The course project is based on the Home Credit Default Risk (HCDR) Kaggle Competition. The goal of this project is to predict whether or not a client will repay a loan. In order to make sure that people who struggle to get loans due to insufficient or non-existent credit histories have a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.
Kaggle is a Data Science Competition Platform which shares a lot of datasets. In the past, it was troublesome to submit your result as your have to go through the console in your browser and drag your files there. Now you can interact with Kaggle via the command line. E.g.,
! kaggle competitions files home-credit-default-risk
It is quite easy to setup, it takes me less than 15 minutes to finish a submission.
kaggle.json filekaggle.json in the right placeFor more detailed information on setting the Kaggle API see here and here.
Team name: FP_Group22
Team members:
Phase leadership plan:
Credit assignment plan:
!pip install kaggle
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Requirement already satisfied: kaggle in /usr/local/lib/python3.9/dist-packages (1.5.13) Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.9/dist-packages (from kaggle) (1.16.0) Requirement already satisfied: certifi in /usr/local/lib/python3.9/dist-packages (from kaggle) (2022.12.7) Requirement already satisfied: python-dateutil in /usr/local/lib/python3.9/dist-packages (from kaggle) (2.8.2) Requirement already satisfied: urllib3 in /usr/local/lib/python3.9/dist-packages (from kaggle) (1.26.15) Requirement already satisfied: python-slugify in /usr/local/lib/python3.9/dist-packages (from kaggle) (8.0.1) Requirement already satisfied: tqdm in /usr/local/lib/python3.9/dist-packages (from kaggle) (4.65.0) Requirement already satisfied: requests in /usr/local/lib/python3.9/dist-packages (from kaggle) (2.27.1) Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.9/dist-packages (from python-slugify->kaggle) (1.3) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.9/dist-packages (from requests->kaggle) (3.4) Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.9/dist-packages (from requests->kaggle) (2.0.12)
!pwd
/content
!pwd
/content
!ls -l kaggle.json
-rw-r--r-- 1 root root 69 Apr 26 01:44 kaggle.json
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle
!chmod 600 ~/.kaggle/kaggle.json
! kaggle competitions files home-credit-default-risk
name size creationDate ---------------------------------- ----- ------------------- installments_payments.csv 690MB 2019-12-11 02:55:35 bureau_balance.csv 358MB 2019-12-11 02:55:35 bureau.csv 162MB 2019-12-11 02:55:35 application_test.csv 25MB 2019-12-11 02:55:35 HomeCredit_columns_description.csv 37KB 2019-12-11 02:55:35 POS_CASH_balance.csv 375MB 2019-12-11 02:55:35 application_train.csv 158MB 2019-12-11 02:55:35 credit_card_balance.csv 405MB 2019-12-11 02:55:35 sample_submission.csv 524KB 2019-12-11 02:55:35 previous_application.csv 386MB 2019-12-11 02:55:35
Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.
Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.
While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.
Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.
The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.
Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).
While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.
The HomeCredit_columns_description.csv acts as a data dictioanry.
There are 7 different sources of data:
name [ rows cols] MegaBytes
----------------------- ------------------ -------
application_train : [ 307,511, 122]: 158MB
application_test : [ 48,744, 121]: 25MB
bureau : [ 1,716,428, 17] 162MB
bureau_balance : [ 27,299,925, 3]: 358MB
credit_card_balance : [ 3,840,312, 23] 405MB
installments_payments : [ 13,605,401, 8] 690MB
previous_application : [ 1,670,214, 37] 386MB
POS_CASH_balance : [ 10,001,358, 8] 375MB

Create a base directory:
DATA_DIR = "../../../Data/home-credit-default-risk" #same level as course repo in the data directory
Please download the project data files and data dictionary and unzip them using either of the following approaches:
Download button on the following Data Webpage and unzip the zip file to the BASE_DIRDATA_DIR = "../data/home-credit-default-risk" #same level as course repo in the data directory
#DATA_DIR = os.path.join('./ddddd/')
!mkdir ../data ../data/home-credit-default-risk
!ls -l {DATA_DIR}
total 0
! kaggle competitions download home-credit-default-risk -p $DATA_DIR
Downloading home-credit-default-risk.zip to ../data/home-credit-default-risk 100% 688M/688M [00:23<00:00, 34.1MB/s] 100% 688M/688M [00:23<00:00, 31.1MB/s]
!pwd
/content
!ls -l $DATA_DIR
total 704708 -rw-r--r-- 1 root root 721616255 Apr 26 01:45 home-credit-default-risk.zip
#!rm -r DATA_DIR
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import os
import zipfile
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
unzippingReq = True #True
if unzippingReq: #please modify this code
zip_ref = zipfile.ZipFile(f'{DATA_DIR}/home-credit-default-risk.zip', 'r')
# extractall(): Extract all members from the archive to the current working directory. path specifies a different directory to extract to
zip_ref.extractall('../data/home-credit-default-risk')
zip_ref.close()

ls -l ../data/home-credit-default-risk/application_train.csv/
ls: cannot access '../data/home-credit-default-risk/application_train.csv/': Not a directory
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import os
import zipfile
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
def load_data(in_path, name):
df = pd.read_csv(in_path)
print(f"{name}: shape is {df.shape}")
print(df.info())
display(df.head(5))
return df
datasets={} # lets store the datasets in a dictionary so we can keep track of them easily
ds_name = 'application_train'
DATA_DIR=f"../data/home-credit-default-risk/"
pd.set_option('display.max_columns', None)
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
datasets['application_train'].shape
application_train: shape is (307511, 122) <class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(41), object(16) memory usage: 286.2+ MB None
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | FONDKAPREMONT_MODE | HOUSETYPE_MODE | TOTALAREA_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.018801 | -9461 | -637 | -3648.0 | -2120 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.083037 | 0.262949 | 0.139376 | 0.0247 | 0.0369 | 0.9722 | 0.6192 | 0.0143 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0369 | 0.0202 | 0.0190 | 0.0000 | 0.0000 | 0.0252 | 0.0383 | 0.9722 | 0.6341 | 0.0144 | 0.0000 | 0.0690 | 0.0833 | 0.1250 | 0.0377 | 0.022 | 0.0198 | 0.0 | 0.0 | 0.0250 | 0.0369 | 0.9722 | 0.6243 | 0.0144 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0375 | 0.0205 | 0.0193 | 0.0000 | 0.00 | reg oper account | block of flats | 0.0149 | Stone, brick | No | 2.0 | 2.0 | 2.0 | 2.0 | -1134.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | -16765 | -1188 | -1186.0 | -291 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Core staff | 2.0 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 0.311267 | 0.622246 | NaN | 0.0959 | 0.0529 | 0.9851 | 0.7960 | 0.0605 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0130 | 0.0773 | 0.0549 | 0.0039 | 0.0098 | 0.0924 | 0.0538 | 0.9851 | 0.8040 | 0.0497 | 0.0806 | 0.0345 | 0.2917 | 0.3333 | 0.0128 | 0.079 | 0.0554 | 0.0 | 0.0 | 0.0968 | 0.0529 | 0.9851 | 0.7987 | 0.0608 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0132 | 0.0787 | 0.0558 | 0.0039 | 0.01 | reg oper account | block of flats | 0.0714 | Block | No | 1.0 | 0.0 | 1.0 | 0.0 | -828.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.010032 | -19046 | -225 | -4260.0 | -2531 | 26.0 | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Government | NaN | 0.555912 | 0.729567 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -815.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.008019 | -19005 | -3039 | -9833.0 | -2437 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2.0 | 2 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | NaN | 0.650442 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.0 | 0.0 | 2.0 | 0.0 | -617.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.028663 | -19932 | -3038 | -4311.0 | -3458 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Core staff | 1.0 | 2 | 2 | THURSDAY | 11 | 0 | 0 | 0 | 0 | 1 | 1 | Religion | NaN | 0.322738 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -1106.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
(307511, 122)
The training data has 307511 observations with each row representing one loan detail including Target feature (0: Loan repaid and 1: Loan not repaid) along with other 121 features.
#COLUMNS PRESENT IN THE DATA
print(datasets['application_train'].columns)
Index(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER',
'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
'AMT_CREDIT', 'AMT_ANNUITY',
...
'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20',
'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR',
'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',
'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT',
'AMT_REQ_CREDIT_BUREAU_YEAR'],
dtype='object', length=122)
DATA_DIR
'../data/home-credit-default-risk/'
ds_name = 'application_test'
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
application_test: shape is (48744, 121) <class 'pandas.core.frame.DataFrame'> RangeIndex: 48744 entries, 0 to 48743 Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(40), object(16) memory usage: 45.0+ MB None
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | FONDKAPREMONT_MODE | HOUSETYPE_MODE | TOTALAREA_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | 450000.0 | Unaccompanied | Working | Higher education | Married | House / apartment | 0.018850 | -19241 | -2329 | -5170.0 | -812 | NaN | 1 | 1 | 0 | 1 | 0 | 1 | NaN | 2.0 | 2 | 2 | TUESDAY | 18 | 0 | 0 | 0 | 0 | 0 | 0 | Kindergarten | 0.752614 | 0.789654 | 0.159520 | 0.0660 | 0.0590 | 0.9732 | NaN | NaN | NaN | 0.1379 | 0.125 | NaN | NaN | NaN | 0.0505 | NaN | NaN | 0.0672 | 0.0612 | 0.9732 | NaN | NaN | NaN | 0.1379 | 0.125 | NaN | NaN | NaN | 0.0526 | NaN | NaN | 0.0666 | 0.0590 | 0.9732 | NaN | NaN | NaN | 0.1379 | 0.125 | NaN | NaN | NaN | 0.0514 | NaN | NaN | NaN | block of flats | 0.0392 | Stone, brick | No | 0.0 | 0.0 | 0.0 | 0.0 | -1740.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | 180000.0 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.035792 | -18064 | -4469 | -9118.0 | -1623 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Low-skill Laborers | 2.0 | 2 | 2 | FRIDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Self-employed | 0.564990 | 0.291656 | 0.432962 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 2 | 100013 | Cash loans | M | Y | Y | 0 | 202500.0 | 663264.0 | 69777.0 | 630000.0 | NaN | Working | Higher education | Married | House / apartment | 0.019101 | -20038 | -4458 | -2175.0 | -3503 | 5.0 | 1 | 1 | 0 | 1 | 0 | 0 | Drivers | 2.0 | 2 | 2 | MONDAY | 14 | 0 | 0 | 0 | 0 | 0 | 0 | Transport: type 3 | NaN | 0.699787 | 0.610991 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -856.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 |
| 3 | 100028 | Cash loans | F | N | Y | 2 | 315000.0 | 1575000.0 | 49018.5 | 1575000.0 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.026392 | -13976 | -1866 | -2000.0 | -4208 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Sales staff | 4.0 | 2 | 2 | WEDNESDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.525734 | 0.509677 | 0.612704 | 0.3052 | 0.1974 | 0.9970 | 0.9592 | 0.1165 | 0.32 | 0.2759 | 0.375 | 0.0417 | 0.2042 | 0.2404 | 0.3673 | 0.0386 | 0.08 | 0.3109 | 0.2049 | 0.9970 | 0.9608 | 0.1176 | 0.3222 | 0.2759 | 0.375 | 0.0417 | 0.2089 | 0.2626 | 0.3827 | 0.0389 | 0.0847 | 0.3081 | 0.1974 | 0.9970 | 0.9597 | 0.1173 | 0.32 | 0.2759 | 0.375 | 0.0417 | 0.2078 | 0.2446 | 0.3739 | 0.0388 | 0.0817 | reg oper account | block of flats | 0.3700 | Panel | No | 0.0 | 0.0 | 0.0 | 0.0 | -1805.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 4 | 100038 | Cash loans | M | Y | N | 1 | 180000.0 | 625500.0 | 32067.0 | 625500.0 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.010032 | -13040 | -2191 | -4000.0 | -4262 | 16.0 | 1 | 1 | 1 | 1 | 0 | 0 | NaN | 3.0 | 2 | 2 | FRIDAY | 5 | 0 | 0 | 0 | 0 | 1 | 1 | Business Entity Type 3 | 0.202145 | 0.425687 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -821.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
The application dataset has the most information about the client: Gender, income, family status, education ...
%%time
ds_names = ("application_train", "application_test", "bureau","bureau_balance","credit_card_balance","installments_payments",
"previous_application","POS_CASH_balance")
for ds_name in ds_names:
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
application_train: shape is (307511, 122) <class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(41), object(16) memory usage: 286.2+ MB None
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | FONDKAPREMONT_MODE | HOUSETYPE_MODE | TOTALAREA_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.018801 | -9461 | -637 | -3648.0 | -2120 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.083037 | 0.262949 | 0.139376 | 0.0247 | 0.0369 | 0.9722 | 0.6192 | 0.0143 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0369 | 0.0202 | 0.0190 | 0.0000 | 0.0000 | 0.0252 | 0.0383 | 0.9722 | 0.6341 | 0.0144 | 0.0000 | 0.0690 | 0.0833 | 0.1250 | 0.0377 | 0.022 | 0.0198 | 0.0 | 0.0 | 0.0250 | 0.0369 | 0.9722 | 0.6243 | 0.0144 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0375 | 0.0205 | 0.0193 | 0.0000 | 0.00 | reg oper account | block of flats | 0.0149 | Stone, brick | No | 2.0 | 2.0 | 2.0 | 2.0 | -1134.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | -16765 | -1188 | -1186.0 | -291 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Core staff | 2.0 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 0.311267 | 0.622246 | NaN | 0.0959 | 0.0529 | 0.9851 | 0.7960 | 0.0605 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0130 | 0.0773 | 0.0549 | 0.0039 | 0.0098 | 0.0924 | 0.0538 | 0.9851 | 0.8040 | 0.0497 | 0.0806 | 0.0345 | 0.2917 | 0.3333 | 0.0128 | 0.079 | 0.0554 | 0.0 | 0.0 | 0.0968 | 0.0529 | 0.9851 | 0.7987 | 0.0608 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0132 | 0.0787 | 0.0558 | 0.0039 | 0.01 | reg oper account | block of flats | 0.0714 | Block | No | 1.0 | 0.0 | 1.0 | 0.0 | -828.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.010032 | -19046 | -225 | -4260.0 | -2531 | 26.0 | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Government | NaN | 0.555912 | 0.729567 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -815.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.008019 | -19005 | -3039 | -9833.0 | -2437 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2.0 | 2 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | NaN | 0.650442 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.0 | 0.0 | 2.0 | 0.0 | -617.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.028663 | -19932 | -3038 | -4311.0 | -3458 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Core staff | 1.0 | 2 | 2 | THURSDAY | 11 | 0 | 0 | 0 | 0 | 1 | 1 | Religion | NaN | 0.322738 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -1106.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
application_test: shape is (48744, 121) <class 'pandas.core.frame.DataFrame'> RangeIndex: 48744 entries, 0 to 48743 Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(40), object(16) memory usage: 45.0+ MB None
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | FONDKAPREMONT_MODE | HOUSETYPE_MODE | TOTALAREA_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | 450000.0 | Unaccompanied | Working | Higher education | Married | House / apartment | 0.018850 | -19241 | -2329 | -5170.0 | -812 | NaN | 1 | 1 | 0 | 1 | 0 | 1 | NaN | 2.0 | 2 | 2 | TUESDAY | 18 | 0 | 0 | 0 | 0 | 0 | 0 | Kindergarten | 0.752614 | 0.789654 | 0.159520 | 0.0660 | 0.0590 | 0.9732 | NaN | NaN | NaN | 0.1379 | 0.125 | NaN | NaN | NaN | 0.0505 | NaN | NaN | 0.0672 | 0.0612 | 0.9732 | NaN | NaN | NaN | 0.1379 | 0.125 | NaN | NaN | NaN | 0.0526 | NaN | NaN | 0.0666 | 0.0590 | 0.9732 | NaN | NaN | NaN | 0.1379 | 0.125 | NaN | NaN | NaN | 0.0514 | NaN | NaN | NaN | block of flats | 0.0392 | Stone, brick | No | 0.0 | 0.0 | 0.0 | 0.0 | -1740.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | 180000.0 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.035792 | -18064 | -4469 | -9118.0 | -1623 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Low-skill Laborers | 2.0 | 2 | 2 | FRIDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Self-employed | 0.564990 | 0.291656 | 0.432962 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 2 | 100013 | Cash loans | M | Y | Y | 0 | 202500.0 | 663264.0 | 69777.0 | 630000.0 | NaN | Working | Higher education | Married | House / apartment | 0.019101 | -20038 | -4458 | -2175.0 | -3503 | 5.0 | 1 | 1 | 0 | 1 | 0 | 0 | Drivers | 2.0 | 2 | 2 | MONDAY | 14 | 0 | 0 | 0 | 0 | 0 | 0 | Transport: type 3 | NaN | 0.699787 | 0.610991 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -856.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 |
| 3 | 100028 | Cash loans | F | N | Y | 2 | 315000.0 | 1575000.0 | 49018.5 | 1575000.0 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.026392 | -13976 | -1866 | -2000.0 | -4208 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Sales staff | 4.0 | 2 | 2 | WEDNESDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.525734 | 0.509677 | 0.612704 | 0.3052 | 0.1974 | 0.9970 | 0.9592 | 0.1165 | 0.32 | 0.2759 | 0.375 | 0.0417 | 0.2042 | 0.2404 | 0.3673 | 0.0386 | 0.08 | 0.3109 | 0.2049 | 0.9970 | 0.9608 | 0.1176 | 0.3222 | 0.2759 | 0.375 | 0.0417 | 0.2089 | 0.2626 | 0.3827 | 0.0389 | 0.0847 | 0.3081 | 0.1974 | 0.9970 | 0.9597 | 0.1173 | 0.32 | 0.2759 | 0.375 | 0.0417 | 0.2078 | 0.2446 | 0.3739 | 0.0388 | 0.0817 | reg oper account | block of flats | 0.3700 | Panel | No | 0.0 | 0.0 | 0.0 | 0.0 | -1805.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 4 | 100038 | Cash loans | M | Y | N | 1 | 180000.0 | 625500.0 | 32067.0 | 625500.0 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.010032 | -13040 | -2191 | -4000.0 | -4262 | 16.0 | 1 | 1 | 1 | 1 | 0 | 0 | NaN | 3.0 | 2 | 2 | FRIDAY | 5 | 0 | 0 | 0 | 0 | 1 | 1 | Business Entity Type 3 | 0.202145 | 0.425687 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -821.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
bureau: shape is (1716428, 17) <class 'pandas.core.frame.DataFrame'> RangeIndex: 1716428 entries, 0 to 1716427 Data columns (total 17 columns): # Column Dtype --- ------ ----- 0 SK_ID_CURR int64 1 SK_ID_BUREAU int64 2 CREDIT_ACTIVE object 3 CREDIT_CURRENCY object 4 DAYS_CREDIT int64 5 CREDIT_DAY_OVERDUE int64 6 DAYS_CREDIT_ENDDATE float64 7 DAYS_ENDDATE_FACT float64 8 AMT_CREDIT_MAX_OVERDUE float64 9 CNT_CREDIT_PROLONG int64 10 AMT_CREDIT_SUM float64 11 AMT_CREDIT_SUM_DEBT float64 12 AMT_CREDIT_SUM_LIMIT float64 13 AMT_CREDIT_SUM_OVERDUE float64 14 CREDIT_TYPE object 15 DAYS_CREDIT_UPDATE int64 16 AMT_ANNUITY float64 dtypes: float64(8), int64(6), object(3) memory usage: 222.6+ MB None
| SK_ID_CURR | SK_ID_BUREAU | CREDIT_ACTIVE | CREDIT_CURRENCY | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | CREDIT_TYPE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 215354 | 5714462 | Closed | currency 1 | -497 | 0 | -153.0 | -153.0 | NaN | 0 | 91323.0 | 0.0 | NaN | 0.0 | Consumer credit | -131 | NaN |
| 1 | 215354 | 5714463 | Active | currency 1 | -208 | 0 | 1075.0 | NaN | NaN | 0 | 225000.0 | 171342.0 | NaN | 0.0 | Credit card | -20 | NaN |
| 2 | 215354 | 5714464 | Active | currency 1 | -203 | 0 | 528.0 | NaN | NaN | 0 | 464323.5 | NaN | NaN | 0.0 | Consumer credit | -16 | NaN |
| 3 | 215354 | 5714465 | Active | currency 1 | -203 | 0 | NaN | NaN | NaN | 0 | 90000.0 | NaN | NaN | 0.0 | Credit card | -16 | NaN |
| 4 | 215354 | 5714466 | Active | currency 1 | -629 | 0 | 1197.0 | NaN | 77674.5 | 0 | 2700000.0 | NaN | NaN | 0.0 | Consumer credit | -21 | NaN |
bureau_balance: shape is (27299925, 3) <class 'pandas.core.frame.DataFrame'> RangeIndex: 27299925 entries, 0 to 27299924 Data columns (total 3 columns): # Column Dtype --- ------ ----- 0 SK_ID_BUREAU int64 1 MONTHS_BALANCE int64 2 STATUS object dtypes: int64(2), object(1) memory usage: 624.8+ MB None
| SK_ID_BUREAU | MONTHS_BALANCE | STATUS | |
|---|---|---|---|
| 0 | 5715448 | 0 | C |
| 1 | 5715448 | -1 | C |
| 2 | 5715448 | -2 | C |
| 3 | 5715448 | -3 | C |
| 4 | 5715448 | -4 | C |
credit_card_balance: shape is (3840312, 23) <class 'pandas.core.frame.DataFrame'> RangeIndex: 3840312 entries, 0 to 3840311 Data columns (total 23 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 MONTHS_BALANCE int64 3 AMT_BALANCE float64 4 AMT_CREDIT_LIMIT_ACTUAL int64 5 AMT_DRAWINGS_ATM_CURRENT float64 6 AMT_DRAWINGS_CURRENT float64 7 AMT_DRAWINGS_OTHER_CURRENT float64 8 AMT_DRAWINGS_POS_CURRENT float64 9 AMT_INST_MIN_REGULARITY float64 10 AMT_PAYMENT_CURRENT float64 11 AMT_PAYMENT_TOTAL_CURRENT float64 12 AMT_RECEIVABLE_PRINCIPAL float64 13 AMT_RECIVABLE float64 14 AMT_TOTAL_RECEIVABLE float64 15 CNT_DRAWINGS_ATM_CURRENT float64 16 CNT_DRAWINGS_CURRENT int64 17 CNT_DRAWINGS_OTHER_CURRENT float64 18 CNT_DRAWINGS_POS_CURRENT float64 19 CNT_INSTALMENT_MATURE_CUM float64 20 NAME_CONTRACT_STATUS object 21 SK_DPD int64 22 SK_DPD_DEF int64 dtypes: float64(15), int64(7), object(1) memory usage: 673.9+ MB None
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_ATM_CURRENT | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_INST_MIN_REGULARITY | AMT_PAYMENT_CURRENT | AMT_PAYMENT_TOTAL_CURRENT | AMT_RECEIVABLE_PRINCIPAL | AMT_RECIVABLE | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2562384 | 378907 | -6 | 56.970 | 135000 | 0.0 | 877.5 | 0.0 | 877.5 | 1700.325 | 1800.0 | 1800.0 | 0.000 | 0.000 | 0.000 | 0.0 | 1 | 0.0 | 1.0 | 35.0 | Active | 0 | 0 |
| 1 | 2582071 | 363914 | -1 | 63975.555 | 45000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 2250.000 | 2250.0 | 2250.0 | 60175.080 | 64875.555 | 64875.555 | 1.0 | 1 | 0.0 | 0.0 | 69.0 | Active | 0 | 0 |
| 2 | 1740877 | 371185 | -7 | 31815.225 | 450000 | 0.0 | 0.0 | 0.0 | 0.0 | 2250.000 | 2250.0 | 2250.0 | 26926.425 | 31460.085 | 31460.085 | 0.0 | 0 | 0.0 | 0.0 | 30.0 | Active | 0 | 0 |
| 3 | 1389973 | 337855 | -4 | 236572.110 | 225000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 11795.760 | 11925.0 | 11925.0 | 224949.285 | 233048.970 | 233048.970 | 1.0 | 1 | 0.0 | 0.0 | 10.0 | Active | 0 | 0 |
| 4 | 1891521 | 126868 | -1 | 453919.455 | 450000 | 0.0 | 11547.0 | 0.0 | 11547.0 | 22924.890 | 27000.0 | 27000.0 | 443044.395 | 453919.455 | 453919.455 | 0.0 | 1 | 0.0 | 1.0 | 101.0 | Active | 0 | 0 |
installments_payments: shape is (13605401, 8) <class 'pandas.core.frame.DataFrame'> RangeIndex: 13605401 entries, 0 to 13605400 Data columns (total 8 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 NUM_INSTALMENT_VERSION float64 3 NUM_INSTALMENT_NUMBER int64 4 DAYS_INSTALMENT float64 5 DAYS_ENTRY_PAYMENT float64 6 AMT_INSTALMENT float64 7 AMT_PAYMENT float64 dtypes: float64(5), int64(3) memory usage: 830.4 MB None
| SK_ID_PREV | SK_ID_CURR | NUM_INSTALMENT_VERSION | NUM_INSTALMENT_NUMBER | DAYS_INSTALMENT | DAYS_ENTRY_PAYMENT | AMT_INSTALMENT | AMT_PAYMENT | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1054186 | 161674 | 1.0 | 6 | -1180.0 | -1187.0 | 6948.360 | 6948.360 |
| 1 | 1330831 | 151639 | 0.0 | 34 | -2156.0 | -2156.0 | 1716.525 | 1716.525 |
| 2 | 2085231 | 193053 | 2.0 | 1 | -63.0 | -63.0 | 25425.000 | 25425.000 |
| 3 | 2452527 | 199697 | 1.0 | 3 | -2418.0 | -2426.0 | 24350.130 | 24350.130 |
| 4 | 2714724 | 167756 | 1.0 | 2 | -1383.0 | -1366.0 | 2165.040 | 2160.585 |
previous_application: shape is (1670214, 37) <class 'pandas.core.frame.DataFrame'> RangeIndex: 1670214 entries, 0 to 1670213 Data columns (total 37 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SK_ID_PREV 1670214 non-null int64 1 SK_ID_CURR 1670214 non-null int64 2 NAME_CONTRACT_TYPE 1670214 non-null object 3 AMT_ANNUITY 1297979 non-null float64 4 AMT_APPLICATION 1670214 non-null float64 5 AMT_CREDIT 1670213 non-null float64 6 AMT_DOWN_PAYMENT 774370 non-null float64 7 AMT_GOODS_PRICE 1284699 non-null float64 8 WEEKDAY_APPR_PROCESS_START 1670214 non-null object 9 HOUR_APPR_PROCESS_START 1670214 non-null int64 10 FLAG_LAST_APPL_PER_CONTRACT 1670214 non-null object 11 NFLAG_LAST_APPL_IN_DAY 1670214 non-null int64 12 RATE_DOWN_PAYMENT 774370 non-null float64 13 RATE_INTEREST_PRIMARY 5951 non-null float64 14 RATE_INTEREST_PRIVILEGED 5951 non-null float64 15 NAME_CASH_LOAN_PURPOSE 1670214 non-null object 16 NAME_CONTRACT_STATUS 1670214 non-null object 17 DAYS_DECISION 1670214 non-null int64 18 NAME_PAYMENT_TYPE 1670214 non-null object 19 CODE_REJECT_REASON 1670214 non-null object 20 NAME_TYPE_SUITE 849809 non-null object 21 NAME_CLIENT_TYPE 1670214 non-null object 22 NAME_GOODS_CATEGORY 1670214 non-null object 23 NAME_PORTFOLIO 1670214 non-null object 24 NAME_PRODUCT_TYPE 1670214 non-null object 25 CHANNEL_TYPE 1670214 non-null object 26 SELLERPLACE_AREA 1670214 non-null int64 27 NAME_SELLER_INDUSTRY 1670214 non-null object 28 CNT_PAYMENT 1297984 non-null float64 29 NAME_YIELD_GROUP 1670214 non-null object 30 PRODUCT_COMBINATION 1669868 non-null object 31 DAYS_FIRST_DRAWING 997149 non-null float64 32 DAYS_FIRST_DUE 997149 non-null float64 33 DAYS_LAST_DUE_1ST_VERSION 997149 non-null float64 34 DAYS_LAST_DUE 997149 non-null float64 35 DAYS_TERMINATION 997149 non-null float64 36 NFLAG_INSURED_ON_APPROVAL 997149 non-null float64 dtypes: float64(15), int64(6), object(16) memory usage: 471.5+ MB None
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | FLAG_LAST_APPL_PER_CONTRACT | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | RATE_INTEREST_PRIMARY | RATE_INTEREST_PRIVILEGED | NAME_CASH_LOAN_PURPOSE | NAME_CONTRACT_STATUS | DAYS_DECISION | NAME_PAYMENT_TYPE | CODE_REJECT_REASON | NAME_TYPE_SUITE | NAME_CLIENT_TYPE | NAME_GOODS_CATEGORY | NAME_PORTFOLIO | NAME_PRODUCT_TYPE | CHANNEL_TYPE | SELLERPLACE_AREA | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | Consumer loans | 1730.430 | 17145.0 | 17145.0 | 0.0 | 17145.0 | SATURDAY | 15 | Y | 1 | 0.0 | 0.182832 | 0.867336 | XAP | Approved | -73 | Cash through the bank | XAP | NaN | Repeater | Mobile | POS | XNA | Country-wide | 35 | Connectivity | 12.0 | middle | POS mobile with interest | 365243.0 | -42.0 | 300.0 | -42.0 | -37.0 | 0.0 |
| 1 | 2802425 | 108129 | Cash loans | 25188.615 | 607500.0 | 679671.0 | NaN | 607500.0 | THURSDAY | 11 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -164 | XNA | XAP | Unaccompanied | Repeater | XNA | Cash | x-sell | Contact center | -1 | XNA | 36.0 | low_action | Cash X-Sell: low | 365243.0 | -134.0 | 916.0 | 365243.0 | 365243.0 | 1.0 |
| 2 | 2523466 | 122040 | Cash loans | 15060.735 | 112500.0 | 136444.5 | NaN | 112500.0 | TUESDAY | 11 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -301 | Cash through the bank | XAP | Spouse, partner | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.0 | high | Cash X-Sell: high | 365243.0 | -271.0 | 59.0 | 365243.0 | 365243.0 | 1.0 |
| 3 | 2819243 | 176158 | Cash loans | 47041.335 | 450000.0 | 470790.0 | NaN | 450000.0 | MONDAY | 7 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -512 | Cash through the bank | XAP | NaN | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.0 | middle | Cash X-Sell: middle | 365243.0 | -482.0 | -152.0 | -182.0 | -177.0 | 1.0 |
| 4 | 1784265 | 202054 | Cash loans | 31924.395 | 337500.0 | 404055.0 | NaN | 337500.0 | THURSDAY | 9 | Y | 1 | NaN | NaN | NaN | Repairs | Refused | -781 | Cash through the bank | HC | NaN | Repeater | XNA | Cash | walk-in | Credit and cash offices | -1 | XNA | 24.0 | high | Cash Street: high | NaN | NaN | NaN | NaN | NaN | NaN |
POS_CASH_balance: shape is (10001358, 8) <class 'pandas.core.frame.DataFrame'> RangeIndex: 10001358 entries, 0 to 10001357 Data columns (total 8 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 MONTHS_BALANCE int64 3 CNT_INSTALMENT float64 4 CNT_INSTALMENT_FUTURE float64 5 NAME_CONTRACT_STATUS object 6 SK_DPD int64 7 SK_DPD_DEF int64 dtypes: float64(2), int64(5), object(1) memory usage: 610.4+ MB None
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | CNT_INSTALMENT | CNT_INSTALMENT_FUTURE | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1803195 | 182943 | -31 | 48.0 | 45.0 | Active | 0 | 0 |
| 1 | 1715348 | 367990 | -33 | 36.0 | 35.0 | Active | 0 | 0 |
| 2 | 1784872 | 397406 | -32 | 12.0 | 9.0 | Active | 0 | 0 |
| 3 | 1903291 | 269225 | -35 | 48.0 | 42.0 | Active | 0 | 0 |
| 4 | 2341044 | 334279 | -35 | 36.0 | 35.0 | Active | 0 | 0 |
CPU times: user 35 s, sys: 7.56 s, total: 42.5 s Wall time: 43.4 s
for ds_name in datasets.keys():
print(f'dataset {ds_name:24}: [ {datasets[ds_name].shape[0]:10,}, {datasets[ds_name].shape[1]}]')
dataset application_train : [ 307,511, 122] dataset application_test : [ 48,744, 121] dataset bureau : [ 1,716,428, 17] dataset bureau_balance : [ 27,299,925, 3] dataset credit_card_balance : [ 3,840,312, 23] dataset installments_payments : [ 13,605,401, 8] dataset previous_application : [ 1,670,214, 37] dataset POS_CASH_balance : [ 10,001,358, 8]
This process will summarize the data using statistical and visualization approaches with the objective to focus on key features of the data so that data can be cleaned for training
# COLUMNS INFO
datasets["application_train"].info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(41), object(16) memory usage: 286.2+ MB
datasets["application_train"].describe() #numerical only features
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | TOTALAREA_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.000000 | 307511.000000 | 307511.000000 | 3.075110e+05 | 3.075110e+05 | 307499.000000 | 3.072330e+05 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 104582.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307509.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 134133.000000 | 3.068510e+05 | 246546.000000 | 151450.00000 | 127568.000000 | 157504.000000 | 103023.000000 | 92646.000000 | 143620.000000 | 152683.000000 | 154491.000000 | 98869.000000 | 124921.000000 | 97312.000000 | 153161.000000 | 93997.000000 | 137829.000000 | 151450.000000 | 127568.000000 | 157504.000000 | 103023.000000 | 92646.000000 | 143620.000000 | 152683.000000 | 154491.000000 | 98869.000000 | 124921.000000 | 97312.000000 | 153161.000000 | 93997.000000 | 137829.000000 | 151450.000000 | 127568.000000 | 157504.000000 | 103023.000000 | 92646.000000 | 143620.000000 | 152683.000000 | 154491.000000 | 98869.000000 | 124921.000000 | 97312.000000 | 153161.000000 | 93997.000000 | 137829.000000 | 159080.000000 | 306490.000000 | 306490.000000 | 306490.000000 | 306490.000000 | 307510.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.00000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 |
| mean | 278180.518577 | 0.080729 | 0.417052 | 1.687979e+05 | 5.990260e+05 | 27108.573909 | 5.383962e+05 | 0.020868 | -16036.995067 | 63815.045904 | -4986.120328 | -2994.202373 | 12.061091 | 0.999997 | 0.819889 | 0.199368 | 0.998133 | 0.281066 | 0.056720 | 2.152665 | 2.052463 | 2.031521 | 12.063419 | 0.015144 | 0.050769 | 0.040659 | 0.078173 | 0.230454 | 0.179555 | 0.502130 | 5.143927e-01 | 0.510853 | 0.11744 | 0.088442 | 0.977735 | 0.752471 | 0.044621 | 0.078942 | 0.149725 | 0.226282 | 0.231894 | 0.066333 | 0.100775 | 0.107399 | 0.008809 | 0.028358 | 0.114231 | 0.087543 | 0.977065 | 0.759637 | 0.042553 | 0.074490 | 0.145193 | 0.222315 | 0.228058 | 0.064958 | 0.105645 | 0.105975 | 0.008076 | 0.027022 | 0.117850 | 0.087955 | 0.977752 | 0.755746 | 0.044595 | 0.078078 | 0.149213 | 0.225897 | 0.231625 | 0.067169 | 0.101954 | 0.108607 | 0.008651 | 0.028236 | 0.102547 | 1.422245 | 0.143421 | 1.405292 | 0.100049 | -962.858788 | 0.000042 | 0.710023 | 0.000081 | 0.015115 | 0.088055 | 0.000192 | 0.081376 | 0.003896 | 0.000023 | 0.003912 | 0.000007 | 0.003525 | 0.002936 | 0.00121 | 0.009928 | 0.000267 | 0.008130 | 0.000595 | 0.000507 | 0.000335 | 0.006402 | 0.007000 | 0.034362 | 0.267395 | 0.265474 | 1.899974 |
| std | 102790.175348 | 0.272419 | 0.722121 | 2.371231e+05 | 4.024908e+05 | 14493.737315 | 3.694465e+05 | 0.013831 | 4363.988632 | 141275.766519 | 3522.886321 | 1509.450419 | 11.944812 | 0.001803 | 0.384280 | 0.399526 | 0.043164 | 0.449521 | 0.231307 | 0.910682 | 0.509034 | 0.502737 | 3.265832 | 0.122126 | 0.219526 | 0.197499 | 0.268444 | 0.421124 | 0.383817 | 0.211062 | 1.910602e-01 | 0.194844 | 0.10824 | 0.082438 | 0.059223 | 0.113280 | 0.076036 | 0.134576 | 0.100049 | 0.144641 | 0.161380 | 0.081184 | 0.092576 | 0.110565 | 0.047732 | 0.069523 | 0.107936 | 0.084307 | 0.064575 | 0.110111 | 0.074445 | 0.132256 | 0.100977 | 0.143709 | 0.161160 | 0.081750 | 0.097880 | 0.111845 | 0.046276 | 0.070254 | 0.109076 | 0.082179 | 0.059897 | 0.112066 | 0.076144 | 0.134467 | 0.100368 | 0.145067 | 0.161934 | 0.082167 | 0.093642 | 0.112260 | 0.047415 | 0.070166 | 0.107462 | 2.400989 | 0.446698 | 2.379803 | 0.362291 | 826.808487 | 0.006502 | 0.453752 | 0.009016 | 0.122010 | 0.283376 | 0.013850 | 0.273412 | 0.062295 | 0.004771 | 0.062424 | 0.002550 | 0.059268 | 0.054110 | 0.03476 | 0.099144 | 0.016327 | 0.089798 | 0.024387 | 0.022518 | 0.018299 | 0.083849 | 0.110757 | 0.204685 | 0.916002 | 0.794056 | 1.869295 |
| min | 100002.000000 | 0.000000 | 0.000000 | 2.565000e+04 | 4.500000e+04 | 1615.500000 | 4.050000e+04 | 0.000290 | -25229.000000 | -17912.000000 | -24672.000000 | -7197.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.014568 | 8.173617e-08 | 0.000527 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -4292.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 189145.500000 | 0.000000 | 0.000000 | 1.125000e+05 | 2.700000e+05 | 16524.000000 | 2.385000e+05 | 0.010006 | -19682.000000 | -2760.000000 | -7479.500000 | -4299.000000 | 5.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 2.000000 | 2.000000 | 2.000000 | 10.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.334007 | 3.924574e-01 | 0.370650 | 0.05770 | 0.044200 | 0.976700 | 0.687200 | 0.007800 | 0.000000 | 0.069000 | 0.166700 | 0.083300 | 0.018700 | 0.050400 | 0.045300 | 0.000000 | 0.000000 | 0.052500 | 0.040700 | 0.976700 | 0.699400 | 0.007200 | 0.000000 | 0.069000 | 0.166700 | 0.083300 | 0.016600 | 0.054200 | 0.042700 | 0.000000 | 0.000000 | 0.058300 | 0.043700 | 0.976700 | 0.691400 | 0.007900 | 0.000000 | 0.069000 | 0.166700 | 0.083300 | 0.018700 | 0.051300 | 0.045700 | 0.000000 | 0.000000 | 0.041200 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -1570.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 278202.000000 | 0.000000 | 0.000000 | 1.471500e+05 | 5.135310e+05 | 24903.000000 | 4.500000e+05 | 0.018850 | -15750.000000 | -1213.000000 | -4504.000000 | -3254.000000 | 9.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 2.000000 | 2.000000 | 2.000000 | 12.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.505998 | 5.659614e-01 | 0.535276 | 0.08760 | 0.076300 | 0.981600 | 0.755200 | 0.021100 | 0.000000 | 0.137900 | 0.166700 | 0.208300 | 0.048100 | 0.075600 | 0.074500 | 0.000000 | 0.003600 | 0.084000 | 0.074600 | 0.981600 | 0.764800 | 0.019000 | 0.000000 | 0.137900 | 0.166700 | 0.208300 | 0.045800 | 0.077100 | 0.073100 | 0.000000 | 0.001100 | 0.086400 | 0.075800 | 0.981600 | 0.758500 | 0.020800 | 0.000000 | 0.137900 | 0.166700 | 0.208300 | 0.048700 | 0.076100 | 0.074900 | 0.000000 | 0.003100 | 0.068800 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -757.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 75% | 367142.500000 | 0.000000 | 1.000000 | 2.025000e+05 | 8.086500e+05 | 34596.000000 | 6.795000e+05 | 0.028663 | -12413.000000 | -289.000000 | -2010.000000 | -1720.000000 | 15.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 3.000000 | 2.000000 | 2.000000 | 14.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.675053 | 6.636171e-01 | 0.669057 | 0.14850 | 0.112200 | 0.986600 | 0.823200 | 0.051500 | 0.120000 | 0.206900 | 0.333300 | 0.375000 | 0.085600 | 0.121000 | 0.129900 | 0.003900 | 0.027700 | 0.143900 | 0.112400 | 0.986600 | 0.823600 | 0.049000 | 0.120800 | 0.206900 | 0.333300 | 0.375000 | 0.084100 | 0.131300 | 0.125200 | 0.003900 | 0.023100 | 0.148900 | 0.111600 | 0.986600 | 0.825600 | 0.051300 | 0.120000 | 0.206900 | 0.333300 | 0.375000 | 0.086800 | 0.123100 | 0.130300 | 0.003900 | 0.026600 | 0.127600 | 2.000000 | 0.000000 | 2.000000 | 0.000000 | -274.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 |
| max | 456255.000000 | 1.000000 | 19.000000 | 1.170000e+08 | 4.050000e+06 | 258025.500000 | 4.050000e+06 | 0.072508 | -7489.000000 | 365243.000000 | 0.000000 | 0.000000 | 91.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 20.000000 | 3.000000 | 3.000000 | 23.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.962693 | 8.549997e-01 | 0.896010 | 1.00000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 348.000000 | 34.000000 | 344.000000 | 24.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.00000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 9.000000 | 8.000000 | 27.000000 | 261.000000 | 25.000000 |
datasets["application_test"].describe() #numerical only features
| SK_ID_CURR | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | TOTALAREA_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 48744.000000 | 48744.000000 | 4.874400e+04 | 4.874400e+04 | 48720.000000 | 4.874400e+04 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 16432.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 28212.000000 | 48736.000000 | 40076.000000 | 24857.000000 | 21103.000000 | 25888.000000 | 16926.000000 | 15249.000000 | 23555.000000 | 25165.000000 | 25423.000000 | 16278.000000 | 20490.000000 | 15964.000000 | 25192.000000 | 15397.000000 | 22660.000000 | 24857.000000 | 21103.000000 | 25888.000000 | 16926.000000 | 15249.000000 | 23555.000000 | 25165.000000 | 25423.000000 | 16278.000000 | 20490.000000 | 15964.000000 | 25192.000000 | 15397.000000 | 22660.000000 | 24857.000000 | 21103.000000 | 25888.000000 | 16926.000000 | 15249.000000 | 23555.000000 | 25165.000000 | 25423.000000 | 16278.000000 | 20490.000000 | 15964.000000 | 25192.000000 | 15397.000000 | 22660.000000 | 26120.000000 | 48715.000000 | 48715.000000 | 48715.000000 | 48715.000000 | 48744.000000 | 48744.0 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.0 | 48744.000000 | 48744.0 | 48744.0 | 48744.0 | 48744.0 | 48744.0 | 48744.0 | 48744.000000 | 48744.0 | 48744.0 | 48744.0 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 |
| mean | 277796.676350 | 0.397054 | 1.784318e+05 | 5.167404e+05 | 29426.240209 | 4.626188e+05 | 0.021226 | -16068.084605 | 67485.366322 | -4967.652716 | -3051.712949 | 11.786027 | 0.999979 | 0.809720 | 0.204702 | 0.998400 | 0.263130 | 0.162646 | 2.146767 | 2.038159 | 2.012596 | 12.007365 | 0.018833 | 0.055166 | 0.042036 | 0.077466 | 0.224664 | 0.174216 | 0.501180 | 0.518021 | 0.500106 | 0.122388 | 0.090065 | 0.978828 | 0.751137 | 0.047624 | 0.085168 | 0.151777 | 0.233706 | 0.238423 | 0.067192 | 0.105885 | 0.112286 | 0.009231 | 0.029387 | 0.119078 | 0.088998 | 0.978292 | 0.758327 | 0.045223 | 0.080570 | 0.147161 | 0.229390 | 0.233854 | 0.065914 | 0.110874 | 0.110687 | 0.008358 | 0.028161 | 0.122809 | 0.089529 | 0.978822 | 0.754344 | 0.047420 | 0.084128 | 0.151200 | 0.233154 | 0.237846 | 0.068069 | 0.107063 | 0.113368 | 0.008979 | 0.029296 | 0.107129 | 1.447644 | 0.143652 | 1.435738 | 0.101139 | -1077.766228 | 0.0 | 0.786620 | 0.000103 | 0.014751 | 0.087477 | 0.000041 | 0.088462 | 0.004493 | 0.0 | 0.001169 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.001559 | 0.0 | 0.0 | 0.0 | 0.002108 | 0.001803 | 0.002787 | 0.009299 | 0.546902 | 1.983769 |
| std | 103169.547296 | 0.709047 | 1.015226e+05 | 3.653970e+05 | 16016.368315 | 3.367102e+05 | 0.014428 | 4325.900393 | 144348.507136 | 3552.612035 | 1569.276709 | 11.462889 | 0.004529 | 0.392526 | 0.403488 | 0.039971 | 0.440337 | 0.369046 | 0.890423 | 0.522694 | 0.515804 | 3.278172 | 0.135937 | 0.228306 | 0.200673 | 0.267332 | 0.417365 | 0.379299 | 0.205142 | 0.181278 | 0.189498 | 0.113112 | 0.081536 | 0.049318 | 0.113188 | 0.082868 | 0.139164 | 0.100669 | 0.147361 | 0.164976 | 0.081909 | 0.098284 | 0.114860 | 0.048749 | 0.072007 | 0.113465 | 0.082655 | 0.053782 | 0.110117 | 0.081169 | 0.137509 | 0.101748 | 0.146485 | 0.165034 | 0.082880 | 0.103980 | 0.116699 | 0.046657 | 0.073504 | 0.114184 | 0.081022 | 0.049663 | 0.111998 | 0.082892 | 0.139014 | 0.100931 | 0.147629 | 0.165241 | 0.082869 | 0.099737 | 0.116503 | 0.048148 | 0.072998 | 0.111420 | 3.608053 | 0.514413 | 3.580125 | 0.403791 | 878.920740 | 0.0 | 0.409698 | 0.010128 | 0.120554 | 0.282536 | 0.006405 | 0.283969 | 0.066879 | 0.0 | 0.034176 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.039456 | 0.0 | 0.0 | 0.0 | 0.046373 | 0.046132 | 0.054037 | 0.110924 | 0.693305 | 1.838873 |
| min | 100001.000000 | 0.000000 | 2.694150e+04 | 4.500000e+04 | 2295.000000 | 4.500000e+04 | 0.000253 | -25195.000000 | -17463.000000 | -23722.000000 | -6348.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | -1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.013458 | 0.000008 | 0.000527 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -4361.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 188557.750000 | 0.000000 | 1.125000e+05 | 2.606400e+05 | 17973.000000 | 2.250000e+05 | 0.010006 | -19637.000000 | -2910.000000 | -7459.250000 | -4448.000000 | 4.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 2.000000 | 2.000000 | 2.000000 | 10.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.343695 | 0.408066 | 0.363945 | 0.061900 | 0.046700 | 0.976700 | 0.687200 | 0.008100 | 0.000000 | 0.074500 | 0.166700 | 0.104200 | 0.019000 | 0.050400 | 0.048575 | 0.000000 | 0.000000 | 0.058800 | 0.042500 | 0.976200 | 0.692900 | 0.007600 | 0.000000 | 0.069000 | 0.166700 | 0.083300 | 0.016525 | 0.055100 | 0.045600 | 0.000000 | 0.000000 | 0.062500 | 0.046150 | 0.976700 | 0.691400 | 0.008000 | 0.000000 | 0.069000 | 0.166700 | 0.083300 | 0.019000 | 0.051300 | 0.049000 | 0.000000 | 0.000000 | 0.043200 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -1766.250000 | 0.0 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 277549.000000 | 0.000000 | 1.575000e+05 | 4.500000e+05 | 26199.000000 | 3.960000e+05 | 0.018850 | -15785.000000 | -1293.000000 | -4490.000000 | -3234.000000 | 9.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 2.000000 | 2.000000 | 2.000000 | 12.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.506771 | 0.558758 | 0.519097 | 0.092800 | 0.078100 | 0.981600 | 0.755200 | 0.022700 | 0.000000 | 0.137900 | 0.166700 | 0.208300 | 0.048300 | 0.075600 | 0.077000 | 0.000000 | 0.003800 | 0.085100 | 0.077000 | 0.981600 | 0.758300 | 0.020300 | 0.000000 | 0.137900 | 0.166700 | 0.208300 | 0.046200 | 0.081700 | 0.075100 | 0.000000 | 0.001200 | 0.092600 | 0.077800 | 0.981600 | 0.758500 | 0.022300 | 0.000000 | 0.137900 | 0.166700 | 0.208300 | 0.048800 | 0.077000 | 0.077600 | 0.000000 | 0.003100 | 0.070700 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -863.000000 | 0.0 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 |
| 75% | 367555.500000 | 1.000000 | 2.250000e+05 | 6.750000e+05 | 37390.500000 | 6.300000e+05 | 0.028663 | -12496.000000 | -296.000000 | -1901.000000 | -1706.000000 | 15.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 3.000000 | 2.000000 | 2.000000 | 14.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.665956 | 0.658497 | 0.652897 | 0.148500 | 0.113400 | 0.986600 | 0.816400 | 0.053900 | 0.160000 | 0.206900 | 0.333300 | 0.375000 | 0.086800 | 0.126900 | 0.137600 | 0.005100 | 0.029000 | 0.150200 | 0.113550 | 0.986600 | 0.823600 | 0.051700 | 0.120800 | 0.206900 | 0.333300 | 0.375000 | 0.085600 | 0.132200 | 0.130600 | 0.003900 | 0.024500 | 0.149900 | 0.113000 | 0.986600 | 0.818900 | 0.053800 | 0.160000 | 0.206900 | 0.333300 | 0.375000 | 0.088000 | 0.126600 | 0.137425 | 0.003900 | 0.028025 | 0.135700 | 2.000000 | 0.000000 | 2.000000 | 0.000000 | -363.000000 | 0.0 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 3.000000 |
| max | 456250.000000 | 20.000000 | 4.410000e+06 | 2.245500e+06 | 180576.000000 | 2.245500e+06 | 0.072508 | -7338.000000 | 365243.000000 | 0.000000 | 0.000000 | 74.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 21.000000 | 3.000000 | 3.000000 | 23.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.939145 | 0.855000 | 0.882530 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 354.000000 | 34.000000 | 351.000000 | 24.000000 | 0.000000 | 0.0 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.0 | 1.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.000000 | 0.0 | 0.0 | 0.0 | 2.000000 | 2.000000 | 2.000000 | 6.000000 | 7.000000 | 17.000000 |
datasets["application_train"].describe(include='all') #look at all categorical and numerical
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | FONDKAPREMONT_MODE | HOUSETYPE_MODE | TOTALAREA_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.000000 | 307511.000000 | 307511 | 307511 | 307511 | 307511 | 307511.000000 | 3.075110e+05 | 3.075110e+05 | 307499.000000 | 3.072330e+05 | 306219 | 307511 | 307511 | 307511 | 307511 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 104582.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 211120 | 307509.000000 | 307511.000000 | 307511.000000 | 307511 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511 | 134133.000000 | 3.068510e+05 | 246546.000000 | 151450.00000 | 127568.000000 | 157504.000000 | 103023.000000 | 92646.000000 | 143620.000000 | 152683.000000 | 154491.000000 | 98869.000000 | 124921.000000 | 97312.000000 | 153161.000000 | 93997.000000 | 137829.000000 | 151450.000000 | 127568.000000 | 157504.000000 | 103023.000000 | 92646.000000 | 143620.000000 | 152683.000000 | 154491.000000 | 98869.000000 | 124921.000000 | 97312.000000 | 153161.000000 | 93997.000000 | 137829.000000 | 151450.000000 | 127568.000000 | 157504.000000 | 103023.000000 | 92646.000000 | 143620.000000 | 152683.000000 | 154491.000000 | 98869.000000 | 124921.000000 | 97312.000000 | 153161.000000 | 93997.000000 | 137829.000000 | 97216 | 153214 | 159080.000000 | 151170 | 161756 | 306490.000000 | 306490.000000 | 306490.000000 | 306490.000000 | 307510.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.00000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 |
| unique | NaN | NaN | 2 | 3 | 2 | 2 | NaN | NaN | NaN | NaN | NaN | 7 | 8 | 5 | 6 | 6 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 18 | NaN | NaN | NaN | 7 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 58 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 4 | 3 | NaN | 7 | 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| top | NaN | NaN | Cash loans | F | N | Y | NaN | NaN | NaN | NaN | NaN | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Laborers | NaN | NaN | NaN | TUESDAY | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Business Entity Type 3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | reg oper account | block of flats | NaN | Panel | No | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| freq | NaN | NaN | 278232 | 202448 | 202924 | 213312 | NaN | NaN | NaN | NaN | NaN | 248526 | 158774 | 218391 | 196432 | 272868 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 55186 | NaN | NaN | NaN | 53901 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 67992 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 73830 | 150503 | NaN | 66040 | 159428 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| mean | 278180.518577 | 0.080729 | NaN | NaN | NaN | NaN | 0.417052 | 1.687979e+05 | 5.990260e+05 | 27108.573909 | 5.383962e+05 | NaN | NaN | NaN | NaN | NaN | 0.020868 | -16036.995067 | 63815.045904 | -4986.120328 | -2994.202373 | 12.061091 | 0.999997 | 0.819889 | 0.199368 | 0.998133 | 0.281066 | 0.056720 | NaN | 2.152665 | 2.052463 | 2.031521 | NaN | 12.063419 | 0.015144 | 0.050769 | 0.040659 | 0.078173 | 0.230454 | 0.179555 | NaN | 0.502130 | 5.143927e-01 | 0.510853 | 0.11744 | 0.088442 | 0.977735 | 0.752471 | 0.044621 | 0.078942 | 0.149725 | 0.226282 | 0.231894 | 0.066333 | 0.100775 | 0.107399 | 0.008809 | 0.028358 | 0.114231 | 0.087543 | 0.977065 | 0.759637 | 0.042553 | 0.074490 | 0.145193 | 0.222315 | 0.228058 | 0.064958 | 0.105645 | 0.105975 | 0.008076 | 0.027022 | 0.117850 | 0.087955 | 0.977752 | 0.755746 | 0.044595 | 0.078078 | 0.149213 | 0.225897 | 0.231625 | 0.067169 | 0.101954 | 0.108607 | 0.008651 | 0.028236 | NaN | NaN | 0.102547 | NaN | NaN | 1.422245 | 0.143421 | 1.405292 | 0.100049 | -962.858788 | 0.000042 | 0.710023 | 0.000081 | 0.015115 | 0.088055 | 0.000192 | 0.081376 | 0.003896 | 0.000023 | 0.003912 | 0.000007 | 0.003525 | 0.002936 | 0.00121 | 0.009928 | 0.000267 | 0.008130 | 0.000595 | 0.000507 | 0.000335 | 0.006402 | 0.007000 | 0.034362 | 0.267395 | 0.265474 | 1.899974 |
| std | 102790.175348 | 0.272419 | NaN | NaN | NaN | NaN | 0.722121 | 2.371231e+05 | 4.024908e+05 | 14493.737315 | 3.694465e+05 | NaN | NaN | NaN | NaN | NaN | 0.013831 | 4363.988632 | 141275.766519 | 3522.886321 | 1509.450419 | 11.944812 | 0.001803 | 0.384280 | 0.399526 | 0.043164 | 0.449521 | 0.231307 | NaN | 0.910682 | 0.509034 | 0.502737 | NaN | 3.265832 | 0.122126 | 0.219526 | 0.197499 | 0.268444 | 0.421124 | 0.383817 | NaN | 0.211062 | 1.910602e-01 | 0.194844 | 0.10824 | 0.082438 | 0.059223 | 0.113280 | 0.076036 | 0.134576 | 0.100049 | 0.144641 | 0.161380 | 0.081184 | 0.092576 | 0.110565 | 0.047732 | 0.069523 | 0.107936 | 0.084307 | 0.064575 | 0.110111 | 0.074445 | 0.132256 | 0.100977 | 0.143709 | 0.161160 | 0.081750 | 0.097880 | 0.111845 | 0.046276 | 0.070254 | 0.109076 | 0.082179 | 0.059897 | 0.112066 | 0.076144 | 0.134467 | 0.100368 | 0.145067 | 0.161934 | 0.082167 | 0.093642 | 0.112260 | 0.047415 | 0.070166 | NaN | NaN | 0.107462 | NaN | NaN | 2.400989 | 0.446698 | 2.379803 | 0.362291 | 826.808487 | 0.006502 | 0.453752 | 0.009016 | 0.122010 | 0.283376 | 0.013850 | 0.273412 | 0.062295 | 0.004771 | 0.062424 | 0.002550 | 0.059268 | 0.054110 | 0.03476 | 0.099144 | 0.016327 | 0.089798 | 0.024387 | 0.022518 | 0.018299 | 0.083849 | 0.110757 | 0.204685 | 0.916002 | 0.794056 | 1.869295 |
| min | 100002.000000 | 0.000000 | NaN | NaN | NaN | NaN | 0.000000 | 2.565000e+04 | 4.500000e+04 | 1615.500000 | 4.050000e+04 | NaN | NaN | NaN | NaN | NaN | 0.000290 | -25229.000000 | -17912.000000 | -24672.000000 | -7197.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | NaN | 1.000000 | 1.000000 | 1.000000 | NaN | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | NaN | 0.014568 | 8.173617e-08 | 0.000527 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -4292.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 189145.500000 | 0.000000 | NaN | NaN | NaN | NaN | 0.000000 | 1.125000e+05 | 2.700000e+05 | 16524.000000 | 2.385000e+05 | NaN | NaN | NaN | NaN | NaN | 0.010006 | -19682.000000 | -2760.000000 | -7479.500000 | -4299.000000 | 5.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | NaN | 2.000000 | 2.000000 | 2.000000 | NaN | 10.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | NaN | 0.334007 | 3.924574e-01 | 0.370650 | 0.05770 | 0.044200 | 0.976700 | 0.687200 | 0.007800 | 0.000000 | 0.069000 | 0.166700 | 0.083300 | 0.018700 | 0.050400 | 0.045300 | 0.000000 | 0.000000 | 0.052500 | 0.040700 | 0.976700 | 0.699400 | 0.007200 | 0.000000 | 0.069000 | 0.166700 | 0.083300 | 0.016600 | 0.054200 | 0.042700 | 0.000000 | 0.000000 | 0.058300 | 0.043700 | 0.976700 | 0.691400 | 0.007900 | 0.000000 | 0.069000 | 0.166700 | 0.083300 | 0.018700 | 0.051300 | 0.045700 | 0.000000 | 0.000000 | NaN | NaN | 0.041200 | NaN | NaN | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -1570.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 278202.000000 | 0.000000 | NaN | NaN | NaN | NaN | 0.000000 | 1.471500e+05 | 5.135310e+05 | 24903.000000 | 4.500000e+05 | NaN | NaN | NaN | NaN | NaN | 0.018850 | -15750.000000 | -1213.000000 | -4504.000000 | -3254.000000 | 9.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | NaN | 2.000000 | 2.000000 | 2.000000 | NaN | 12.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | NaN | 0.505998 | 5.659614e-01 | 0.535276 | 0.08760 | 0.076300 | 0.981600 | 0.755200 | 0.021100 | 0.000000 | 0.137900 | 0.166700 | 0.208300 | 0.048100 | 0.075600 | 0.074500 | 0.000000 | 0.003600 | 0.084000 | 0.074600 | 0.981600 | 0.764800 | 0.019000 | 0.000000 | 0.137900 | 0.166700 | 0.208300 | 0.045800 | 0.077100 | 0.073100 | 0.000000 | 0.001100 | 0.086400 | 0.075800 | 0.981600 | 0.758500 | 0.020800 | 0.000000 | 0.137900 | 0.166700 | 0.208300 | 0.048700 | 0.076100 | 0.074900 | 0.000000 | 0.003100 | NaN | NaN | 0.068800 | NaN | NaN | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -757.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 75% | 367142.500000 | 0.000000 | NaN | NaN | NaN | NaN | 1.000000 | 2.025000e+05 | 8.086500e+05 | 34596.000000 | 6.795000e+05 | NaN | NaN | NaN | NaN | NaN | 0.028663 | -12413.000000 | -289.000000 | -2010.000000 | -1720.000000 | 15.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | NaN | 3.000000 | 2.000000 | 2.000000 | NaN | 14.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | NaN | 0.675053 | 6.636171e-01 | 0.669057 | 0.14850 | 0.112200 | 0.986600 | 0.823200 | 0.051500 | 0.120000 | 0.206900 | 0.333300 | 0.375000 | 0.085600 | 0.121000 | 0.129900 | 0.003900 | 0.027700 | 0.143900 | 0.112400 | 0.986600 | 0.823600 | 0.049000 | 0.120800 | 0.206900 | 0.333300 | 0.375000 | 0.084100 | 0.131300 | 0.125200 | 0.003900 | 0.023100 | 0.148900 | 0.111600 | 0.986600 | 0.825600 | 0.051300 | 0.120000 | 0.206900 | 0.333300 | 0.375000 | 0.086800 | 0.123100 | 0.130300 | 0.003900 | 0.026600 | NaN | NaN | 0.127600 | NaN | NaN | 2.000000 | 0.000000 | 2.000000 | 0.000000 | -274.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 |
| max | 456255.000000 | 1.000000 | NaN | NaN | NaN | NaN | 19.000000 | 1.170000e+08 | 4.050000e+06 | 258025.500000 | 4.050000e+06 | NaN | NaN | NaN | NaN | NaN | 0.072508 | -7489.000000 | 365243.000000 | 0.000000 | 0.000000 | 91.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | NaN | 20.000000 | 3.000000 | 3.000000 | NaN | 23.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | NaN | 0.962693 | 8.549997e-01 | 0.896010 | 1.00000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | NaN | NaN | 1.000000 | NaN | NaN | 348.000000 | 34.000000 | 344.000000 | 24.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.00000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 9.000000 | 8.000000 | 27.000000 | 261.000000 | 25.000000 |
percent = (datasets["application_train"].isnull().sum()/datasets["application_train"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["application_train"].isna().sum().sort_values(ascending = False)
missing_application_train_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Train Missing Count"])
missing_application_train_data.head(20)
| Percent | Train Missing Count | |
|---|---|---|
| COMMONAREA_MEDI | 69.87 | 214865 |
| COMMONAREA_AVG | 69.87 | 214865 |
| COMMONAREA_MODE | 69.87 | 214865 |
| NONLIVINGAPARTMENTS_MODE | 69.43 | 213514 |
| NONLIVINGAPARTMENTS_AVG | 69.43 | 213514 |
| NONLIVINGAPARTMENTS_MEDI | 69.43 | 213514 |
| FONDKAPREMONT_MODE | 68.39 | 210295 |
| LIVINGAPARTMENTS_MODE | 68.35 | 210199 |
| LIVINGAPARTMENTS_AVG | 68.35 | 210199 |
| LIVINGAPARTMENTS_MEDI | 68.35 | 210199 |
| FLOORSMIN_AVG | 67.85 | 208642 |
| FLOORSMIN_MODE | 67.85 | 208642 |
| FLOORSMIN_MEDI | 67.85 | 208642 |
| YEARS_BUILD_MEDI | 66.50 | 204488 |
| YEARS_BUILD_MODE | 66.50 | 204488 |
| YEARS_BUILD_AVG | 66.50 | 204488 |
| OWN_CAR_AGE | 65.99 | 202929 |
| LANDAREA_MEDI | 59.38 | 182590 |
| LANDAREA_MODE | 59.38 | 182590 |
| LANDAREA_AVG | 59.38 | 182590 |
missinga5 = missing_application_train_data.loc[missing_application_train_data['Percent'] > 55]
missinga5
| Percent | Train Missing Count | |
|---|---|---|
| COMMONAREA_MEDI | 69.87 | 214865 |
| COMMONAREA_AVG | 69.87 | 214865 |
| COMMONAREA_MODE | 69.87 | 214865 |
| NONLIVINGAPARTMENTS_MODE | 69.43 | 213514 |
| NONLIVINGAPARTMENTS_AVG | 69.43 | 213514 |
| NONLIVINGAPARTMENTS_MEDI | 69.43 | 213514 |
| FONDKAPREMONT_MODE | 68.39 | 210295 |
| LIVINGAPARTMENTS_MODE | 68.35 | 210199 |
| LIVINGAPARTMENTS_AVG | 68.35 | 210199 |
| LIVINGAPARTMENTS_MEDI | 68.35 | 210199 |
| FLOORSMIN_AVG | 67.85 | 208642 |
| FLOORSMIN_MODE | 67.85 | 208642 |
| FLOORSMIN_MEDI | 67.85 | 208642 |
| YEARS_BUILD_MEDI | 66.50 | 204488 |
| YEARS_BUILD_MODE | 66.50 | 204488 |
| YEARS_BUILD_AVG | 66.50 | 204488 |
| OWN_CAR_AGE | 65.99 | 202929 |
| LANDAREA_MEDI | 59.38 | 182590 |
| LANDAREA_MODE | 59.38 | 182590 |
| LANDAREA_AVG | 59.38 | 182590 |
| BASEMENTAREA_MEDI | 58.52 | 179943 |
| BASEMENTAREA_AVG | 58.52 | 179943 |
| BASEMENTAREA_MODE | 58.52 | 179943 |
| EXT_SOURCE_1 | 56.38 | 173378 |
| NONLIVINGAREA_MODE | 55.18 | 169682 |
| NONLIVINGAREA_AVG | 55.18 | 169682 |
| NONLIVINGAREA_MEDI | 55.18 | 169682 |
In all of the above columns with more than 55% data missing the below three are the least correlated with the TARGET value. Hence we are dropping them from the table entirely.
drop = datasets["application_train"]['NONLIVINGAPARTMENTS_MODE'].corr(datasets["application_train"]['TARGET'])
drop
-0.0015565608459037438
drop = datasets["application_train"]['NONLIVINGAPARTMENTS_AVG'].corr(datasets["application_train"]['TARGET'])
drop
-0.0031761075207261394
drop = datasets["application_train"]['NONLIVINGAPARTMENTS_MEDI'].corr(datasets["application_train"]['TARGET'])
drop
-0.0027571486021083823
import seaborn as sns
import matplotlib.pyplot as plt
# Filter rows with count of missing values greater than 0
missing_application_train_data = missing_application_train_data[missing_application_train_data['Train Missing Count'] > 0]
print("Number of columns with missing value greater than 0:",len(missing_application_train_data))
missing_more_than_60=missing_application_train_data[missing_application_train_data['Percent'] >=60]
print("Number of columns with more than 60% missing values:", len(missing_more_than_60))
plt.figure(figsize=(15, 25))
sns.set_style("whitegrid")
ax = sns.barplot(x="Train Missing Count", y=missing_application_train_data.index, data=missing_application_train_data,
palette="Blues_d")
# Set the figure size and axis labels
ax.set_xlabel("Number of Missing Values")
ax.set_ylabel("Features")
plt.title('Missing values in Train columns')
# Add values to the bars
for i, v in enumerate(missing_application_train_data['Train Missing Count']):
ax.text(v + 10, i + .25, str(v), color='black', fontweight='bold')
# Show the plot
plt.show()
Number of columns with missing value greater than 0: 67 Number of columns with more than 60% missing values: 17
percent = (datasets["application_test"].isnull().sum()/datasets["application_test"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["application_test"].isna().sum().sort_values(ascending = False)
missing_application_test_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Test Missing Count"])
missing_application_test_data.head(20)
| Percent | Test Missing Count | |
|---|---|---|
| COMMONAREA_AVG | 68.72 | 33495 |
| COMMONAREA_MODE | 68.72 | 33495 |
| COMMONAREA_MEDI | 68.72 | 33495 |
| NONLIVINGAPARTMENTS_AVG | 68.41 | 33347 |
| NONLIVINGAPARTMENTS_MODE | 68.41 | 33347 |
| NONLIVINGAPARTMENTS_MEDI | 68.41 | 33347 |
| FONDKAPREMONT_MODE | 67.28 | 32797 |
| LIVINGAPARTMENTS_AVG | 67.25 | 32780 |
| LIVINGAPARTMENTS_MODE | 67.25 | 32780 |
| LIVINGAPARTMENTS_MEDI | 67.25 | 32780 |
| FLOORSMIN_MEDI | 66.61 | 32466 |
| FLOORSMIN_AVG | 66.61 | 32466 |
| FLOORSMIN_MODE | 66.61 | 32466 |
| OWN_CAR_AGE | 66.29 | 32312 |
| YEARS_BUILD_AVG | 65.28 | 31818 |
| YEARS_BUILD_MEDI | 65.28 | 31818 |
| YEARS_BUILD_MODE | 65.28 | 31818 |
| LANDAREA_MEDI | 57.96 | 28254 |
| LANDAREA_AVG | 57.96 | 28254 |
| LANDAREA_MODE | 57.96 | 28254 |
# Filter rows with count of missing values greater than 0
missing_application_test_data = missing_application_test_data[missing_application_test_data['Test Missing Count'] > 0]
print("Number of columns with missing value greater than 0:",len(missing_application_test_data))
missing_more_than_60=missing_application_test_data[missing_application_test_data['Percent'] >=60]
print("Number of columns with more than 60% missing values:", len(missing_more_than_60))
plt.figure(figsize=(15, 25))
sns.set_style("whitegrid")
ax = sns.barplot(x="Test Missing Count", y=missing_application_test_data.index, data=missing_application_test_data,
palette="rocket")
# Set the figure size and axis labels
ax.set_xlabel("Number of Missing Values ")
ax.set_ylabel("Features")
plt.title('Missing values in Test columns')
# Add values to the bars
for i, v in enumerate(missing_application_test_data['Test Missing Count']):
ax.text(v + 10, i + .25, str(v), color='black', fontweight='bold')
# Show the plot
plt.show()
Number of columns with missing value greater than 0: 64 Number of columns with more than 60% missing values: 17
datasets["application_train"] = datasets["application_train"].drop(['NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAPARTMENTS_MEDI'], axis=1)
datasets["application_train"].shape
(307511, 119)
datasets["application_test"] = datasets["application_test"].drop(['NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAPARTMENTS_MEDI'], axis=1)
datasets["application_test"].shape
(48744, 118)
As we can see above in the shape that both in the train and test datasets we have dropped the three columns that had a very low correlation to the target of the dataset.
# Create a bar plot of the distribution of the target column
sns.set(style='darkgrid')
ax = sns.countplot(x='TARGET', data=datasets['application_train'])
# Set the plot title and axis labels
plt.title('Distribution of Target Column')
plt.xlabel('TARGET')
plt.ylabel('Count')
# Set the tick labels for the x-axis
plt.xticks([0, 1], ['Loan Repaid', 'Loan Not-Repaid'])
# Add percentage values on top of the bars
total = float(len(datasets['application_train']))
for p in ax.patches:
percentage = '{:.1f}%'.format(100 * p.get_height() / total)
x = p.get_x() + p.get_width() / 2
y = p.get_height()
ax.annotate(percentage, (x, y), ha='center', va='center', fontsize=12, color='black', xytext=(0, 5),
textcoords='offset points')
# Show the plot
plt.show()
correlations = datasets["application_train"].corr()['TARGET'].sort_values()
print('Most Positive Correlations:\n', correlations.tail(10))
print('\nMost Negative Correlations:\n', correlations.head(10))
Most Positive Correlations: FLAG_DOCUMENT_3 0.044346 REG_CITY_NOT_LIVE_CITY 0.044395 FLAG_EMP_PHONE 0.045982 REG_CITY_NOT_WORK_CITY 0.050994 DAYS_ID_PUBLISH 0.051457 DAYS_LAST_PHONE_CHANGE 0.055218 REGION_RATING_CLIENT 0.058899 REGION_RATING_CLIENT_W_CITY 0.060893 DAYS_BIRTH 0.078239 TARGET 1.000000 Name: TARGET, dtype: float64 Most Negative Correlations: EXT_SOURCE_3 -0.178919 EXT_SOURCE_2 -0.160472 EXT_SOURCE_1 -0.155317 DAYS_EMPLOYED -0.044932 FLOORSMAX_AVG -0.044003 FLOORSMAX_MEDI -0.043768 FLOORSMAX_MODE -0.043226 AMT_GOODS_PRICE -0.039645 REGION_POPULATION_RELATIVE -0.037227 ELEVATORS_AVG -0.034199 Name: TARGET, dtype: float64
correlations = datasets["application_train"].corr()
sns.set_theme(style="white")
sns.heatmap(correlations)
<Axes: >
# First pie chart - loan data by gender
fig, ax = plt.subplots(figsize=(6, 6))
gender_counts = datasets["application_train"]['CODE_GENDER'].value_counts()
gender_labels = gender_counts.index
gender_sizes = gender_counts.values
colors = ["#497AA7", "#73A4CA"]
ax.pie(gender_sizes, labels=gender_labels, autopct='%1.1f%%', startangle=90, textprops={'fontsize': 12}, colors=colors)
# Add a common title for the figure
ax.set_title('Number of Loans taken based on gender ', fontsize=16)
# Set the aspect ratio to "equal" and the x and y limits to the same value to make it a perfect circle
ax.set_aspect('equal')
ax.set_xlim([-1.1, 1.1])
ax.set_ylim([-1.1, 1.1])
# Show the plot
plt.show()
fig,(ax2,ax3)=plt.subplots(1,2, figsize=(10, 5))
colors = ['#FEB5D4', "#E7C5FF"]
# Second pie chart - loan repaid by gender
gender_loan_repaid = datasets["application_train"][datasets["application_train"]['TARGET'] == 0]
gender_loan_repaid_counts = gender_loan_repaid['CODE_GENDER'].value_counts()
gender_loan_repaid_labels = gender_loan_repaid_counts.index
gender_loan_repaid_sizes = gender_loan_repaid_counts.values
ax2.pie(gender_loan_repaid_sizes, labels=gender_loan_repaid_labels, autopct='%1.1f%%', startangle=90, textprops={'fontsize': 12}, colors=colors)
ax2.set_title('Loan Repaid by Gender')
# Third pie chart - loan not repaid by gender
gender_loan_not_repaid = datasets["application_train"][datasets["application_train"]['TARGET'] == 1]
gender_loan_not_repaid_counts = gender_loan_not_repaid['CODE_GENDER'].value_counts()
gender_loan_not_repaid_labels = gender_loan_not_repaid_counts.index
gender_loan_not_repaid_sizes = gender_loan_not_repaid_counts.values
ax3.pie(gender_loan_not_repaid_sizes, labels=gender_loan_not_repaid_labels, autopct='%1.1f%%', startangle=90, textprops={'fontsize': 12}, colors=colors)
ax3.set_title('Loan Not Repaid by Gender')
# Adjust the layout and spacing of the subplots
plt.tight_layout(pad=2)
# Show the plot
plt.show()
plt.hist(datasets["application_train"]['DAYS_BIRTH'] / -365, edgecolor = 'k', bins = 25, color = "#FFD66A")
plt.title('Age of Client'); plt.xlabel('Age (years)'); plt.ylabel('Count');
plt.figure(figsize = (12, 6))
sns.kdeplot(datasets["application_train"].loc[datasets["application_train"]['TARGET']==0, 'DAYS_BIRTH'] / -365, color='purple')
sns.kdeplot(datasets["application_train"].loc[datasets["application_train"]['TARGET']==1, 'DAYS_BIRTH'] / -365, color='green')
plt.xlabel('Age of applicant')
plt.title('Repayment of loan based on age', fontsize=16)
plt.show()
fig, axs = plt.subplots(ncols=1, figsize=(12, 6))
# Create a countplot of NAME_HOUSING_TYPE vs TARGET
sns.countplot(x='NAME_HOUSING_TYPE', hue='TARGET', data=datasets['application_train'], palette = 'crest')
axs.set_title('Applicants Housing Situation')
axs.set_ylabel('Count')
axs.set_xlabel('Housing Type')
plt.show()
# Create a pair of subplots with a shared y-axis
fig, axs = plt.subplots(ncols=2, figsize=(14, 6))
# Create a countplot of FLAG_OWN_CAR vs TARGET
sns.countplot(x='FLAG_OWN_CAR', hue='TARGET', data=datasets['application_train'], ax=axs[0],palette='Set1')
axs[0].set_title('Loan Repaid Based on Car Ownership')
# Create a countplot of FLAG_OWN_REALTY vs TARGET
sns.countplot(x='FLAG_OWN_REALTY', hue='TARGET', data=datasets['application_train'], ax=axs[1],palette='Set1')
axs[1].set_title('Loan Repaid Based on Realty Ownership')
# Set the y-axis label
axs[0].set_ylabel('Count')
axs[1].set_ylabel('Count')
# Set the x-axis label
axs[0].set_xlabel('Car Ownership')
axs[1].set_xlabel('Realty Ownership')
# Show the plot
plt.show()
In the above two graphs we can see that the fact that people own a car or a realty or not does not make a difference on weather they pay their loan back or not.
# First pie chart - loan data by type
fig, ax = plt.subplots(figsize=(6, 6))
loantype_counts = datasets["application_train"]['NAME_CONTRACT_TYPE'].value_counts()
loantype_labels = loantype_counts.index
loantype_sizes = loantype_counts.values
colors = ['#A7E8CB', '#C4DDFF']
ax.pie(loantype_sizes, labels=loantype_labels, autopct='%1.1f%%', startangle=90, textprops={'fontsize': 12}, colors=colors)
# Add a common title for the figure
ax.set_title('Number of Loans taken based on loan type', fontsize=16)
# Set the aspect ratio to "equal" and the x and y limits to the same value to make it a perfect circle
ax.set_aspect('equal')
ax.set_xlim([-1.1, 1.1])
ax.set_ylim([-1.1, 1.1])
# Show the plot
plt.show()
Almost 90% of the loans that are taken out by applicants are cash loans and the other left out percentage are revolving loans. This shows that a preference towards cash loan exists.
fig, axs = plt.subplots(ncols=1, figsize=(12, 6))
# Create a countplot of NAME_EDUCATION_TYPE vs TARGET
sns.countplot(x='NAME_EDUCATION_TYPE', hue='TARGET', data=datasets['application_train'], palette = 'rocket')
axs.set_title('Education level of the Applicant')
# Set the y-axis label
axs.set_ylabel('Count')
axs.set_xlabel('Education Level')
# Show the plot
plt.show()
We can see that most of the applicants have done atleast secondary level of education and a good amount of them have done higher education as well.
fig, axs = plt.subplots(ncols=1, figsize=(20, 24))
# Create a countplot of ORGANIZATION_TYPE vs TARGET
sns.countplot(y='ORGANIZATION_TYPE', hue='TARGET', data=datasets['application_train'], palette = 'rocket')
axs.set_title('Organisation that the Applicant is working in')
# Set the x-axis label
axs.set_xlabel('Count')
axs.set_ylabel('Organisation')
# Show the plot
plt.show()
We can observe that most of the applicants are working in a business entity of some sort but we if we notice there is an anomaly here that the second maximum number of applicants are working under a type of null value and are not labelled appropriately so we need to take care of this in the next phase of the project.
sns.countplot(x='OCCUPATION_TYPE', data=datasets["application_train"]);
plt.title('Applicants Occupation');
plt.xticks(rotation=90);
Most people applying for the loan are working as laborers or sales staff members which seems accurate if u consider the real world scenario so we do not have to do any changes to this column.
plt.figure(figsize=(12, 6))
sns.boxplot(x=datasets['application_train']['AMT_CREDIT'], palette='Set2')
plt.xlabel('Credit')
plt.title('Boxplot for Credit')
plt.show()
Credit is a very important column for any applicant but we can observe many outliers in the graph but since the number of outliers looks so high we might loose valuable information if we do any outlier removal. Also the feature is right skewed.
plt.figure(figsize=(12, 6))
sns.distplot(x=datasets['application_train']['AMT_ANNUITY'], color='purple')
plt.xlabel('Annuity')
plt.title('Distribution plot for Annuity')
plt.show()
plt.figure(figsize=(12, 6))
sns.boxplot(x=datasets['application_train']['AMT_GOODS_PRICE'], palette='Set2')
plt.xlabel('Goods Price')
plt.title('Boxplot for Goods Price')
plt.show()
The feature is right skewed. We can observe many outliers in this graph as well. Since the number of outliers looks so high we might loose valuable information if we do any outlier removal here as well.
datasets["bureau"].info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1716428 entries, 0 to 1716427 Data columns (total 17 columns): # Column Dtype --- ------ ----- 0 SK_ID_CURR int64 1 SK_ID_BUREAU int64 2 CREDIT_ACTIVE object 3 CREDIT_CURRENCY object 4 DAYS_CREDIT int64 5 CREDIT_DAY_OVERDUE int64 6 DAYS_CREDIT_ENDDATE float64 7 DAYS_ENDDATE_FACT float64 8 AMT_CREDIT_MAX_OVERDUE float64 9 CNT_CREDIT_PROLONG int64 10 AMT_CREDIT_SUM float64 11 AMT_CREDIT_SUM_DEBT float64 12 AMT_CREDIT_SUM_LIMIT float64 13 AMT_CREDIT_SUM_OVERDUE float64 14 CREDIT_TYPE object 15 DAYS_CREDIT_UPDATE int64 16 AMT_ANNUITY float64 dtypes: float64(8), int64(6), object(3) memory usage: 222.6+ MB
datasets["bureau"].describe(include='all')
| SK_ID_CURR | SK_ID_BUREAU | CREDIT_ACTIVE | CREDIT_CURRENCY | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | CREDIT_TYPE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1.716428e+06 | 1.716428e+06 | 1716428 | 1716428 | 1.716428e+06 | 1.716428e+06 | 1.610875e+06 | 1.082775e+06 | 5.919400e+05 | 1.716428e+06 | 1.716415e+06 | 1.458759e+06 | 1.124648e+06 | 1.716428e+06 | 1716428 | 1.716428e+06 | 4.896370e+05 |
| unique | NaN | NaN | 4 | 4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 15 | NaN | NaN |
| top | NaN | NaN | Closed | currency 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Consumer credit | NaN | NaN |
| freq | NaN | NaN | 1079273 | 1715020 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1251615 | NaN | NaN |
| mean | 2.782149e+05 | 5.924434e+06 | NaN | NaN | -1.142108e+03 | 8.181666e-01 | 5.105174e+02 | -1.017437e+03 | 3.825418e+03 | 6.410406e-03 | 3.549946e+05 | 1.370851e+05 | 6.229515e+03 | 3.791276e+01 | NaN | -5.937483e+02 | 1.571276e+04 |
| std | 1.029386e+05 | 5.322657e+05 | NaN | NaN | 7.951649e+02 | 3.654443e+01 | 4.994220e+03 | 7.140106e+02 | 2.060316e+05 | 9.622391e-02 | 1.149811e+06 | 6.774011e+05 | 4.503203e+04 | 5.937650e+03 | NaN | 7.207473e+02 | 3.258269e+05 |
| min | 1.000010e+05 | 5.000000e+06 | NaN | NaN | -2.922000e+03 | 0.000000e+00 | -4.206000e+04 | -4.202300e+04 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -4.705600e+06 | -5.864061e+05 | 0.000000e+00 | NaN | -4.194700e+04 | 0.000000e+00 |
| 25% | 1.888668e+05 | 5.463954e+06 | NaN | NaN | -1.666000e+03 | 0.000000e+00 | -1.138000e+03 | -1.489000e+03 | 0.000000e+00 | 0.000000e+00 | 5.130000e+04 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | NaN | -9.080000e+02 | 0.000000e+00 |
| 50% | 2.780550e+05 | 5.926304e+06 | NaN | NaN | -9.870000e+02 | 0.000000e+00 | -3.300000e+02 | -8.970000e+02 | 0.000000e+00 | 0.000000e+00 | 1.255185e+05 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | NaN | -3.950000e+02 | 0.000000e+00 |
| 75% | 3.674260e+05 | 6.385681e+06 | NaN | NaN | -4.740000e+02 | 0.000000e+00 | 4.740000e+02 | -4.250000e+02 | 0.000000e+00 | 0.000000e+00 | 3.150000e+05 | 4.015350e+04 | 0.000000e+00 | 0.000000e+00 | NaN | -3.300000e+01 | 1.350000e+04 |
| max | 4.562550e+05 | 6.843457e+06 | NaN | NaN | 0.000000e+00 | 2.792000e+03 | 3.119900e+04 | 0.000000e+00 | 1.159872e+08 | 9.000000e+00 | 5.850000e+08 | 1.701000e+08 | 4.705600e+06 | 3.756681e+06 | NaN | 3.720000e+02 | 1.184534e+08 |
percent = (datasets["bureau"].isnull().sum()/datasets["bureau"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["bureau"].isna().sum().sort_values(ascending = False)
total_count_with = datasets["bureau"].isnull().count()
total_count_without = datasets["bureau"].count()
missing_application_train_data = pd.concat([percent, sum_missing, total_count_without, total_count_with], axis=1, keys=['Percent', 'Missing Count', 'Total Count without Missing value', 'Total Count'])
missing_application_train_data
| Percent | Missing Count | Total Count without Missing value | Total Count | |
|---|---|---|---|---|
| AMT_ANNUITY | 71.47 | 1226791 | 489637 | 1716428 |
| AMT_CREDIT_MAX_OVERDUE | 65.51 | 1124488 | 591940 | 1716428 |
| DAYS_ENDDATE_FACT | 36.92 | 633653 | 1082775 | 1716428 |
| AMT_CREDIT_SUM_LIMIT | 34.48 | 591780 | 1124648 | 1716428 |
| AMT_CREDIT_SUM_DEBT | 15.01 | 257669 | 1458759 | 1716428 |
| DAYS_CREDIT_ENDDATE | 6.15 | 105553 | 1610875 | 1716428 |
| AMT_CREDIT_SUM | 0.00 | 13 | 1716415 | 1716428 |
| CREDIT_ACTIVE | 0.00 | 0 | 1716428 | 1716428 |
| CREDIT_CURRENCY | 0.00 | 0 | 1716428 | 1716428 |
| DAYS_CREDIT | 0.00 | 0 | 1716428 | 1716428 |
| CREDIT_DAY_OVERDUE | 0.00 | 0 | 1716428 | 1716428 |
| SK_ID_BUREAU | 0.00 | 0 | 1716428 | 1716428 |
| CNT_CREDIT_PROLONG | 0.00 | 0 | 1716428 | 1716428 |
| AMT_CREDIT_SUM_OVERDUE | 0.00 | 0 | 1716428 | 1716428 |
| CREDIT_TYPE | 0.00 | 0 | 1716428 | 1716428 |
| DAYS_CREDIT_UPDATE | 0.00 | 0 | 1716428 | 1716428 |
| SK_ID_CURR | 0.00 | 0 | 1716428 | 1716428 |
percent = (datasets['bureau'].isnull().sum()/datasets['bureau'].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets['bureau'].isna().sum().sort_values(ascending = False)
missing_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Missing Count"])
missing_data=missing_data[missing_data['Percent'] > 0]
f, ax = plt.subplots()
plt.title(f'Missing data for bureau (%)', fontsize=11)
fig=sns.barplot(x = missing_data["Percent"],y = missing_data.index ,alpha=0.8)
plt.xlabel('Missing values (%)', fontsize=10)
plt.ylabel('Features', fontsize=10)
Text(0, 0.5, 'Features')
bureau_df = datasets['bureau']
application_train_df = datasets['application_train']
merged_df = application_train_df[['SK_ID_CURR', 'TARGET']].merge(bureau_df, on='SK_ID_CURR', how='left')
merged_df.head(5)
| SK_ID_CURR | TARGET | SK_ID_BUREAU | CREDIT_ACTIVE | CREDIT_CURRENCY | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | CREDIT_TYPE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | 6158904.0 | Closed | currency 1 | -1125.0 | 0.0 | -1038.0 | -1038.0 | NaN | 0.0 | 40761.0 | NaN | NaN | 0.0 | Credit card | -1038.0 | 0.0 |
| 1 | 100002 | 1 | 6158905.0 | Closed | currency 1 | -476.0 | 0.0 | NaN | -48.0 | NaN | 0.0 | 0.0 | 0.0 | NaN | 0.0 | Credit card | -47.0 | NaN |
| 2 | 100002 | 1 | 6158906.0 | Closed | currency 1 | -1437.0 | 0.0 | -1072.0 | -1185.0 | 0.000 | 0.0 | 135000.0 | 0.0 | 0.0 | 0.0 | Consumer credit | -1185.0 | 0.0 |
| 3 | 100002 | 1 | 6158907.0 | Closed | currency 1 | -1121.0 | 0.0 | -911.0 | -911.0 | 3321.000 | 0.0 | 19071.0 | NaN | NaN | 0.0 | Consumer credit | -906.0 | 0.0 |
| 4 | 100002 | 1 | 6158908.0 | Closed | currency 1 | -645.0 | 0.0 | 85.0 | -36.0 | 5043.645 | 0.0 | 120735.0 | 0.0 | 0.0 | 0.0 | Consumer credit | -34.0 | 0.0 |
fig = plt.figure(figsize=(15,10))
sns.heatmap(merged_df.corr(), annot=True)
<Axes: >
# Removing the outlier which was affecting our plot
merged_df = merged_df[merged_df["SK_ID_CURR"] != 104463]
plt.figure(figsize=(10, 8))
sns.scatterplot(x = "DAYS_CREDIT", y='DAYS_ENDDATE_FACT', hue='TARGET' ,data=merged_df)
plt.title('Days Since Credit Inquiry vs. Days Since Credit Ended W.R.T Loan Repayment Status')
plt.xlabel('Days Since Credit Inquiry')
plt.ylabel('Days Since Credit Ended')
plt.show()
datasets["bureau_balance"].info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 27299925 entries, 0 to 27299924 Data columns (total 3 columns): # Column Dtype --- ------ ----- 0 SK_ID_BUREAU int64 1 MONTHS_BALANCE int64 2 STATUS object dtypes: int64(2), object(1) memory usage: 624.8+ MB
datasets["bureau_balance"].describe(include='all')
| SK_ID_BUREAU | MONTHS_BALANCE | STATUS | |
|---|---|---|---|
| count | 2.729992e+07 | 2.729992e+07 | 27299925 |
| unique | NaN | NaN | 8 |
| top | NaN | NaN | C |
| freq | NaN | NaN | 13646993 |
| mean | 6.036297e+06 | -3.074169e+01 | NaN |
| std | 4.923489e+05 | 2.386451e+01 | NaN |
| min | 5.001709e+06 | -9.600000e+01 | NaN |
| 25% | 5.730933e+06 | -4.600000e+01 | NaN |
| 50% | 6.070821e+06 | -2.500000e+01 | NaN |
| 75% | 6.431951e+06 | -1.100000e+01 | NaN |
| max | 6.842888e+06 | 0.000000e+00 | NaN |
percent = (datasets["bureau_balance"].isnull().sum()/datasets["bureau_balance"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["bureau_balance"].isna().sum().sort_values(ascending = False)
total_count_with = datasets["bureau_balance"].isnull().count()
total_count_without = datasets["bureau_balance"].count()
missing_application_train_data = pd.concat([percent, sum_missing, total_count_without, total_count_with], axis=1, keys=['Percent', 'Missing Count', 'Total Count without Missing value', 'Total Count'])
missing_application_train_data
| Percent | Missing Count | Total Count without Missing value | Total Count | |
|---|---|---|---|---|
| SK_ID_BUREAU | 0.0 | 0 | 27299925 | 27299925 |
| MONTHS_BALANCE | 0.0 | 0 | 27299925 | 27299925 |
| STATUS | 0.0 | 0 | 27299925 | 27299925 |
bureau_balance_df = datasets['bureau_balance']
bureau_balance_merged_df = merged_df[['SK_ID_BUREAU', 'TARGET']].merge(bureau_balance_df, on='SK_ID_BUREAU', how='left')
bureau_balance_merged_df.head(5)
| SK_ID_BUREAU | TARGET | MONTHS_BALANCE | STATUS | |
|---|---|---|---|---|
| 0 | 6158904.0 | 1 | -15.0 | C |
| 1 | 6158904.0 | 1 | -16.0 | C |
| 2 | 6158904.0 | 1 | -17.0 | 0 |
| 3 | 6158904.0 | 1 | -18.0 | 0 |
| 4 | 6158904.0 | 1 | -19.0 | 0 |
fig = plt.figure(figsize=(5,3))
sns.heatmap(bureau_balance_merged_df.corr(), annot=True)
<Axes: >
datasets["credit_card_balance"].info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3840312 entries, 0 to 3840311 Data columns (total 23 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 MONTHS_BALANCE int64 3 AMT_BALANCE float64 4 AMT_CREDIT_LIMIT_ACTUAL int64 5 AMT_DRAWINGS_ATM_CURRENT float64 6 AMT_DRAWINGS_CURRENT float64 7 AMT_DRAWINGS_OTHER_CURRENT float64 8 AMT_DRAWINGS_POS_CURRENT float64 9 AMT_INST_MIN_REGULARITY float64 10 AMT_PAYMENT_CURRENT float64 11 AMT_PAYMENT_TOTAL_CURRENT float64 12 AMT_RECEIVABLE_PRINCIPAL float64 13 AMT_RECIVABLE float64 14 AMT_TOTAL_RECEIVABLE float64 15 CNT_DRAWINGS_ATM_CURRENT float64 16 CNT_DRAWINGS_CURRENT int64 17 CNT_DRAWINGS_OTHER_CURRENT float64 18 CNT_DRAWINGS_POS_CURRENT float64 19 CNT_INSTALMENT_MATURE_CUM float64 20 NAME_CONTRACT_STATUS object 21 SK_DPD int64 22 SK_DPD_DEF int64 dtypes: float64(15), int64(7), object(1) memory usage: 673.9+ MB
datasets["credit_card_balance"].describe(include='all')
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_ATM_CURRENT | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_INST_MIN_REGULARITY | AMT_PAYMENT_CURRENT | AMT_PAYMENT_TOTAL_CURRENT | AMT_RECEIVABLE_PRINCIPAL | AMT_RECIVABLE | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 3.840312e+06 | 3.840312e+06 | 3.840312e+06 | 3.840312e+06 | 3.840312e+06 | 3.090496e+06 | 3.840312e+06 | 3.090496e+06 | 3.090496e+06 | 3.535076e+06 | 3.072324e+06 | 3.840312e+06 | 3.840312e+06 | 3.840312e+06 | 3.840312e+06 | 3.090496e+06 | 3.840312e+06 | 3.090496e+06 | 3.090496e+06 | 3.535076e+06 | 3840312 | 3.840312e+06 | 3.840312e+06 |
| unique | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 7 | NaN | NaN |
| top | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Active | NaN | NaN |
| freq | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 3698436 | NaN | NaN |
| mean | 1.904504e+06 | 2.783242e+05 | -3.452192e+01 | 5.830016e+04 | 1.538080e+05 | 5.961325e+03 | 7.433388e+03 | 2.881696e+02 | 2.968805e+03 | 3.540204e+03 | 1.028054e+04 | 7.588857e+03 | 5.596588e+04 | 5.808881e+04 | 5.809829e+04 | 3.094490e-01 | 7.031439e-01 | 4.812496e-03 | 5.594791e-01 | 2.082508e+01 | NaN | 9.283667e+00 | 3.316220e-01 |
| std | 5.364695e+05 | 1.027045e+05 | 2.666775e+01 | 1.063070e+05 | 1.651457e+05 | 2.822569e+04 | 3.384608e+04 | 8.201989e+03 | 2.079689e+04 | 5.600154e+03 | 3.607808e+04 | 3.200599e+04 | 1.025336e+05 | 1.059654e+05 | 1.059718e+05 | 1.100401e+00 | 3.190347e+00 | 8.263861e-02 | 3.240649e+00 | 2.005149e+01 | NaN | 9.751570e+01 | 2.147923e+01 |
| min | 1.000018e+06 | 1.000060e+05 | -9.600000e+01 | -4.202502e+05 | 0.000000e+00 | -6.827310e+03 | -6.211620e+03 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -4.233058e+05 | -4.202502e+05 | -4.202502e+05 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | NaN | 0.000000e+00 | 0.000000e+00 |
| 25% | 1.434385e+06 | 1.895170e+05 | -5.500000e+01 | 0.000000e+00 | 4.500000e+04 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.523700e+02 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 4.000000e+00 | NaN | 0.000000e+00 | 0.000000e+00 |
| 50% | 1.897122e+06 | 2.783960e+05 | -2.800000e+01 | 0.000000e+00 | 1.125000e+05 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 2.702700e+03 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.500000e+01 | NaN | 0.000000e+00 | 0.000000e+00 |
| 75% | 2.369328e+06 | 3.675800e+05 | -1.100000e+01 | 8.904669e+04 | 1.800000e+05 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 6.633911e+03 | 9.000000e+03 | 6.750000e+03 | 8.535924e+04 | 8.889949e+04 | 8.891451e+04 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.200000e+01 | NaN | 0.000000e+00 | 0.000000e+00 |
| max | 2.843496e+06 | 4.562500e+05 | -1.000000e+00 | 1.505902e+06 | 1.350000e+06 | 2.115000e+06 | 2.287098e+06 | 1.529847e+06 | 2.239274e+06 | 2.028820e+05 | 4.289207e+06 | 4.278316e+06 | 1.472317e+06 | 1.493338e+06 | 1.493338e+06 | 5.100000e+01 | 1.650000e+02 | 1.200000e+01 | 1.650000e+02 | 1.200000e+02 | NaN | 3.260000e+03 | 3.260000e+03 |
percent = (datasets["credit_card_balance"].isnull().sum()/datasets["credit_card_balance"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["credit_card_balance"].isna().sum().sort_values(ascending = False)
total_count_with = datasets["credit_card_balance"].isnull().count()
total_count_without = datasets["credit_card_balance"].count()
missing_application_train_data = pd.concat([percent, sum_missing, total_count_without, total_count_with], axis=1, keys=['Percent', 'Missing Count', 'Total Count without Missing value', 'Total Count'])
missing_application_train_data
| Percent | Missing Count | Total Count without Missing value | Total Count | |
|---|---|---|---|---|
| AMT_PAYMENT_CURRENT | 20.00 | 767988 | 3072324 | 3840312 |
| AMT_DRAWINGS_ATM_CURRENT | 19.52 | 749816 | 3090496 | 3840312 |
| CNT_DRAWINGS_POS_CURRENT | 19.52 | 749816 | 3090496 | 3840312 |
| AMT_DRAWINGS_OTHER_CURRENT | 19.52 | 749816 | 3090496 | 3840312 |
| AMT_DRAWINGS_POS_CURRENT | 19.52 | 749816 | 3090496 | 3840312 |
| CNT_DRAWINGS_OTHER_CURRENT | 19.52 | 749816 | 3090496 | 3840312 |
| CNT_DRAWINGS_ATM_CURRENT | 19.52 | 749816 | 3090496 | 3840312 |
| CNT_INSTALMENT_MATURE_CUM | 7.95 | 305236 | 3535076 | 3840312 |
| AMT_INST_MIN_REGULARITY | 7.95 | 305236 | 3535076 | 3840312 |
| SK_ID_PREV | 0.00 | 0 | 3840312 | 3840312 |
| AMT_TOTAL_RECEIVABLE | 0.00 | 0 | 3840312 | 3840312 |
| SK_DPD | 0.00 | 0 | 3840312 | 3840312 |
| NAME_CONTRACT_STATUS | 0.00 | 0 | 3840312 | 3840312 |
| CNT_DRAWINGS_CURRENT | 0.00 | 0 | 3840312 | 3840312 |
| AMT_PAYMENT_TOTAL_CURRENT | 0.00 | 0 | 3840312 | 3840312 |
| AMT_RECIVABLE | 0.00 | 0 | 3840312 | 3840312 |
| AMT_RECEIVABLE_PRINCIPAL | 0.00 | 0 | 3840312 | 3840312 |
| SK_ID_CURR | 0.00 | 0 | 3840312 | 3840312 |
| AMT_DRAWINGS_CURRENT | 0.00 | 0 | 3840312 | 3840312 |
| AMT_CREDIT_LIMIT_ACTUAL | 0.00 | 0 | 3840312 | 3840312 |
| AMT_BALANCE | 0.00 | 0 | 3840312 | 3840312 |
| MONTHS_BALANCE | 0.00 | 0 | 3840312 | 3840312 |
| SK_DPD_DEF | 0.00 | 0 | 3840312 | 3840312 |
percent = (datasets['credit_card_balance'].isnull().sum()/datasets['credit_card_balance'].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets['credit_card_balance'].isna().sum().sort_values(ascending = False)
missing_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Missing Count"])
missing_data = missing_data[missing_data['Percent'] > 0]
plt.title(f'Missing data for credit_card_balance in percent', fontsize=10)
sns.barplot(x = missing_data["Percent"],y = missing_data.index ,alpha=0.8)
plt.xlabel('Missing values (%)', fontsize=10)
plt.ylabel('Features', fontsize=10)
plt.show()
fig, ax = plt.subplots()
ax.hist(datasets['credit_card_balance']['AMT_PAYMENT_TOTAL_CURRENT'], bins=50)
ax.set_title('Distribution of AMT_PAYMENT_TOTAL_CURRENT')
plt.xlabel("Total Amount in the corresponding month")
plt.ylabel("Count")
ax.tick_params(labelsize=10,length=0)
plt.box(False)
plt.ticklabel_format(style='plain', axis='y')
ax.ticklabel_format(useOffset=False, style='plain')
ax.yaxis.grid(linewidth=0.5,color='black')
ax.set_axisbelow(True)
fig, ax = plt.subplots()
sns.histplot(data=datasets['credit_card_balance'], x='AMT_CREDIT_LIMIT_ACTUAL', kde=True)
plt.title('Distribution of Credit Card Limits')
plt.xlabel('Credit Card Limit')
plt.ylabel('Count')
ax.ticklabel_format(useOffset=False, style='plain')
plt.show()
fig, ax = plt.subplots()
datasets['credit_card_balance']["credit_limit"] = datasets['credit_card_balance']["AMT_CREDIT_LIMIT_ACTUAL"] >= datasets['credit_card_balance']["AMT_BALANCE"]
datasets['credit_card_balance']["credit_limit"] = datasets['credit_card_balance']["credit_limit"].apply(lambda x: "Less spent" if x == True else "More spent")
sns.scatterplot(data=datasets['credit_card_balance'], x='AMT_CREDIT_LIMIT_ACTUAL', y='AMT_DRAWINGS_CURRENT', hue = "credit_limit")
plt.title('Credit Card Limit vs. Amount Spent During the Month')
plt.xlabel('Credit Card Limit')
plt.ylabel('Amount Spent During the Month')
ax.ticklabel_format(useOffset=False, style='plain')
plt.show()
merged_df = datasets['application_train'][['SK_ID_CURR', 'TARGET']].merge(datasets['credit_card_balance'], on='SK_ID_CURR', how='left')
merged_df.head(5)
| SK_ID_CURR | TARGET | SK_ID_PREV | MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_ATM_CURRENT | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_INST_MIN_REGULARITY | AMT_PAYMENT_CURRENT | AMT_PAYMENT_TOTAL_CURRENT | AMT_RECEIVABLE_PRINCIPAL | AMT_RECIVABLE | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 100003 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 100004 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 100006 | 0 | 1489396.0 | -2.0 | 0.0 | 270000.0 | NaN | 0.0 | NaN | NaN | 0.0 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | NaN | NaN | 0.0 | Active | 0.0 | 0.0 |
| 4 | 100006 | 0 | 1489396.0 | -1.0 | 0.0 | 270000.0 | NaN | 0.0 | NaN | NaN | 0.0 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | NaN | NaN | 0.0 | Active | 0.0 | 0.0 |
merged_df.corr()['TARGET'].sort_values(ascending= False)
TARGET 1.000000 AMT_BALANCE 0.050098 AMT_TOTAL_RECEIVABLE 0.049839 AMT_RECIVABLE 0.049803 AMT_RECEIVABLE_PRINCIPAL 0.049692 AMT_INST_MIN_REGULARITY 0.039798 CNT_DRAWINGS_ATM_CURRENT 0.038437 CNT_DRAWINGS_CURRENT 0.037793 MONTHS_BALANCE 0.035695 CNT_DRAWINGS_POS_CURRENT 0.029536 AMT_DRAWINGS_ATM_CURRENT 0.024700 AMT_DRAWINGS_CURRENT 0.022378 AMT_CREDIT_LIMIT_ACTUAL 0.013823 AMT_PAYMENT_CURRENT 0.012929 AMT_PAYMENT_TOTAL_CURRENT 0.012302 SK_DPD_DEF 0.010538 AMT_DRAWINGS_POS_CURRENT 0.005084 AMT_DRAWINGS_OTHER_CURRENT 0.003843 CNT_DRAWINGS_OTHER_CURRENT 0.003044 SK_ID_PREV 0.002571 SK_DPD 0.001684 SK_ID_CURR -0.004412 CNT_INSTALMENT_MATURE_CUM -0.023684 Name: TARGET, dtype: float64
fig = plt.figure(figsize=(20,15))
sns.heatmap(merged_df.corr(), annot=True)
<Axes: >
The insight from the correlation matrix is that there are weak positive correlations between the TARGET column and the credit card features in the merged dataset.
The AMT_BALANCE, AMT_TOTAL_RECEIVABLE, AMT_RECIVABLE, and AMT_RECEIVABLE_PRINCIPAL columns have the highest correlation with the TARGET column, indicating that borrowers with higher credit card balances and receivables are more likely to default on their loans.
Other features such as AMT_INST_MIN_REGULARITY, CNT_DRAWINGS_ATM_CURRENT, and CNT_DRAWINGS_CURRENT also have a weak correlation with the TARGET column. Based on these correlations, it may be important to consider credit card balance and receivables as key features in predicting loan defaults. This suggests that borrowers who make smaller minimum payments on their credit cards, as well as those who make more ATM withdrawals and overall credit card drawings, may be more likely to have difficulty repaying their loans.
edaPAdf = pd.DataFrame()
edaPAdf['SK_ID_CURR'] = datasets['application_train']['SK_ID_CURR']
edaPAdf['TARGET'] = datasets['application_train']['TARGET']
datasets['previous_application'].shape
(1670214, 37)
datasets['previous_application'].info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1670214 entries, 0 to 1670213 Data columns (total 37 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SK_ID_PREV 1670214 non-null int64 1 SK_ID_CURR 1670214 non-null int64 2 NAME_CONTRACT_TYPE 1670214 non-null object 3 AMT_ANNUITY 1297979 non-null float64 4 AMT_APPLICATION 1670214 non-null float64 5 AMT_CREDIT 1670213 non-null float64 6 AMT_DOWN_PAYMENT 774370 non-null float64 7 AMT_GOODS_PRICE 1284699 non-null float64 8 WEEKDAY_APPR_PROCESS_START 1670214 non-null object 9 HOUR_APPR_PROCESS_START 1670214 non-null int64 10 FLAG_LAST_APPL_PER_CONTRACT 1670214 non-null object 11 NFLAG_LAST_APPL_IN_DAY 1670214 non-null int64 12 RATE_DOWN_PAYMENT 774370 non-null float64 13 RATE_INTEREST_PRIMARY 5951 non-null float64 14 RATE_INTEREST_PRIVILEGED 5951 non-null float64 15 NAME_CASH_LOAN_PURPOSE 1670214 non-null object 16 NAME_CONTRACT_STATUS 1670214 non-null object 17 DAYS_DECISION 1670214 non-null int64 18 NAME_PAYMENT_TYPE 1670214 non-null object 19 CODE_REJECT_REASON 1670214 non-null object 20 NAME_TYPE_SUITE 849809 non-null object 21 NAME_CLIENT_TYPE 1670214 non-null object 22 NAME_GOODS_CATEGORY 1670214 non-null object 23 NAME_PORTFOLIO 1670214 non-null object 24 NAME_PRODUCT_TYPE 1670214 non-null object 25 CHANNEL_TYPE 1670214 non-null object 26 SELLERPLACE_AREA 1670214 non-null int64 27 NAME_SELLER_INDUSTRY 1670214 non-null object 28 CNT_PAYMENT 1297984 non-null float64 29 NAME_YIELD_GROUP 1670214 non-null object 30 PRODUCT_COMBINATION 1669868 non-null object 31 DAYS_FIRST_DRAWING 997149 non-null float64 32 DAYS_FIRST_DUE 997149 non-null float64 33 DAYS_LAST_DUE_1ST_VERSION 997149 non-null float64 34 DAYS_LAST_DUE 997149 non-null float64 35 DAYS_TERMINATION 997149 non-null float64 36 NFLAG_INSURED_ON_APPROVAL 997149 non-null float64 dtypes: float64(15), int64(6), object(16) memory usage: 471.5+ MB
datasets['previous_application'].describe()
| SK_ID_PREV | SK_ID_CURR | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | HOUR_APPR_PROCESS_START | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | RATE_INTEREST_PRIMARY | RATE_INTEREST_PRIVILEGED | DAYS_DECISION | SELLERPLACE_AREA | CNT_PAYMENT | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1.670214e+06 | 1.670214e+06 | 1.297979e+06 | 1.670214e+06 | 1.670213e+06 | 7.743700e+05 | 1.284699e+06 | 1.670214e+06 | 1.670214e+06 | 774370.000000 | 5951.000000 | 5951.000000 | 1.670214e+06 | 1.670214e+06 | 1.297984e+06 | 997149.000000 | 997149.000000 | 997149.000000 | 997149.000000 | 997149.000000 | 997149.000000 |
| mean | 1.923089e+06 | 2.783572e+05 | 1.595512e+04 | 1.752339e+05 | 1.961140e+05 | 6.697402e+03 | 2.278473e+05 | 1.248418e+01 | 9.964675e-01 | 0.079637 | 0.188357 | 0.773503 | -8.806797e+02 | 3.139511e+02 | 1.605408e+01 | 342209.855039 | 13826.269337 | 33767.774054 | 76582.403064 | 81992.343838 | 0.332570 |
| std | 5.325980e+05 | 1.028148e+05 | 1.478214e+04 | 2.927798e+05 | 3.185746e+05 | 2.092150e+04 | 3.153966e+05 | 3.334028e+00 | 5.932963e-02 | 0.107823 | 0.087671 | 0.100879 | 7.790997e+02 | 7.127443e+03 | 1.456729e+01 | 88916.115833 | 72444.869708 | 106857.034789 | 149647.415123 | 153303.516729 | 0.471134 |
| min | 1.000001e+06 | 1.000010e+05 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -9.000000e-01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -0.000015 | 0.034781 | 0.373150 | -2.922000e+03 | -1.000000e+00 | 0.000000e+00 | -2922.000000 | -2892.000000 | -2801.000000 | -2889.000000 | -2874.000000 | 0.000000 |
| 25% | 1.461857e+06 | 1.893290e+05 | 6.321780e+03 | 1.872000e+04 | 2.416050e+04 | 0.000000e+00 | 5.084100e+04 | 1.000000e+01 | 1.000000e+00 | 0.000000 | 0.160716 | 0.715645 | -1.300000e+03 | -1.000000e+00 | 6.000000e+00 | 365243.000000 | -1628.000000 | -1242.000000 | -1314.000000 | -1270.000000 | 0.000000 |
| 50% | 1.923110e+06 | 2.787145e+05 | 1.125000e+04 | 7.104600e+04 | 8.054100e+04 | 1.638000e+03 | 1.123200e+05 | 1.200000e+01 | 1.000000e+00 | 0.051605 | 0.189122 | 0.835095 | -5.810000e+02 | 3.000000e+00 | 1.200000e+01 | 365243.000000 | -831.000000 | -361.000000 | -537.000000 | -499.000000 | 0.000000 |
| 75% | 2.384280e+06 | 3.675140e+05 | 2.065842e+04 | 1.803600e+05 | 2.164185e+05 | 7.740000e+03 | 2.340000e+05 | 1.500000e+01 | 1.000000e+00 | 0.108909 | 0.193330 | 0.852537 | -2.800000e+02 | 8.200000e+01 | 2.400000e+01 | 365243.000000 | -411.000000 | 129.000000 | -74.000000 | -44.000000 | 1.000000 |
| max | 2.845382e+06 | 4.562550e+05 | 4.180581e+05 | 6.905160e+06 | 6.905160e+06 | 3.060045e+06 | 6.905160e+06 | 2.300000e+01 | 1.000000e+00 | 1.000000 | 1.000000 | 1.000000 | -1.000000e+00 | 4.000000e+06 | 8.400000e+01 | 365243.000000 | 365243.000000 | 365243.000000 | 365243.000000 | 365243.000000 | 1.000000 |
datasets['previous_application'].describe(include='all')
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | FLAG_LAST_APPL_PER_CONTRACT | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | RATE_INTEREST_PRIMARY | RATE_INTEREST_PRIVILEGED | NAME_CASH_LOAN_PURPOSE | NAME_CONTRACT_STATUS | DAYS_DECISION | NAME_PAYMENT_TYPE | CODE_REJECT_REASON | NAME_TYPE_SUITE | NAME_CLIENT_TYPE | NAME_GOODS_CATEGORY | NAME_PORTFOLIO | NAME_PRODUCT_TYPE | CHANNEL_TYPE | SELLERPLACE_AREA | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1.670214e+06 | 1.670214e+06 | 1670214 | 1.297979e+06 | 1.670214e+06 | 1.670213e+06 | 7.743700e+05 | 1.284699e+06 | 1670214 | 1.670214e+06 | 1670214 | 1.670214e+06 | 774370.000000 | 5951.000000 | 5951.000000 | 1670214 | 1670214 | 1.670214e+06 | 1670214 | 1670214 | 849809 | 1670214 | 1670214 | 1670214 | 1670214 | 1670214 | 1.670214e+06 | 1670214 | 1.297984e+06 | 1670214 | 1669868 | 997149.000000 | 997149.000000 | 997149.000000 | 997149.000000 | 997149.000000 | 997149.000000 |
| unique | NaN | NaN | 4 | NaN | NaN | NaN | NaN | NaN | 7 | NaN | 2 | NaN | NaN | NaN | NaN | 25 | 4 | NaN | 4 | 9 | 7 | 4 | 28 | 5 | 3 | 8 | NaN | 11 | NaN | 5 | 17 | NaN | NaN | NaN | NaN | NaN | NaN |
| top | NaN | NaN | Cash loans | NaN | NaN | NaN | NaN | NaN | TUESDAY | NaN | Y | NaN | NaN | NaN | NaN | XAP | Approved | NaN | Cash through the bank | XAP | Unaccompanied | Repeater | XNA | POS | XNA | Credit and cash offices | NaN | XNA | NaN | XNA | Cash | NaN | NaN | NaN | NaN | NaN | NaN |
| freq | NaN | NaN | 747553 | NaN | NaN | NaN | NaN | NaN | 255118 | NaN | 1661739 | NaN | NaN | NaN | NaN | 922661 | 1036781 | NaN | 1033552 | 1353093 | 508970 | 1231261 | 950809 | 691011 | 1063666 | 719968 | NaN | 855720 | NaN | 517215 | 285990 | NaN | NaN | NaN | NaN | NaN | NaN |
| mean | 1.923089e+06 | 2.783572e+05 | NaN | 1.595512e+04 | 1.752339e+05 | 1.961140e+05 | 6.697402e+03 | 2.278473e+05 | NaN | 1.248418e+01 | NaN | 9.964675e-01 | 0.079637 | 0.188357 | 0.773503 | NaN | NaN | -8.806797e+02 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 3.139511e+02 | NaN | 1.605408e+01 | NaN | NaN | 342209.855039 | 13826.269337 | 33767.774054 | 76582.403064 | 81992.343838 | 0.332570 |
| std | 5.325980e+05 | 1.028148e+05 | NaN | 1.478214e+04 | 2.927798e+05 | 3.185746e+05 | 2.092150e+04 | 3.153966e+05 | NaN | 3.334028e+00 | NaN | 5.932963e-02 | 0.107823 | 0.087671 | 0.100879 | NaN | NaN | 7.790997e+02 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 7.127443e+03 | NaN | 1.456729e+01 | NaN | NaN | 88916.115833 | 72444.869708 | 106857.034789 | 149647.415123 | 153303.516729 | 0.471134 |
| min | 1.000001e+06 | 1.000010e+05 | NaN | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -9.000000e-01 | 0.000000e+00 | NaN | 0.000000e+00 | NaN | 0.000000e+00 | -0.000015 | 0.034781 | 0.373150 | NaN | NaN | -2.922000e+03 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | -1.000000e+00 | NaN | 0.000000e+00 | NaN | NaN | -2922.000000 | -2892.000000 | -2801.000000 | -2889.000000 | -2874.000000 | 0.000000 |
| 25% | 1.461857e+06 | 1.893290e+05 | NaN | 6.321780e+03 | 1.872000e+04 | 2.416050e+04 | 0.000000e+00 | 5.084100e+04 | NaN | 1.000000e+01 | NaN | 1.000000e+00 | 0.000000 | 0.160716 | 0.715645 | NaN | NaN | -1.300000e+03 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | -1.000000e+00 | NaN | 6.000000e+00 | NaN | NaN | 365243.000000 | -1628.000000 | -1242.000000 | -1314.000000 | -1270.000000 | 0.000000 |
| 50% | 1.923110e+06 | 2.787145e+05 | NaN | 1.125000e+04 | 7.104600e+04 | 8.054100e+04 | 1.638000e+03 | 1.123200e+05 | NaN | 1.200000e+01 | NaN | 1.000000e+00 | 0.051605 | 0.189122 | 0.835095 | NaN | NaN | -5.810000e+02 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 3.000000e+00 | NaN | 1.200000e+01 | NaN | NaN | 365243.000000 | -831.000000 | -361.000000 | -537.000000 | -499.000000 | 0.000000 |
| 75% | 2.384280e+06 | 3.675140e+05 | NaN | 2.065842e+04 | 1.803600e+05 | 2.164185e+05 | 7.740000e+03 | 2.340000e+05 | NaN | 1.500000e+01 | NaN | 1.000000e+00 | 0.108909 | 0.193330 | 0.852537 | NaN | NaN | -2.800000e+02 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 8.200000e+01 | NaN | 2.400000e+01 | NaN | NaN | 365243.000000 | -411.000000 | 129.000000 | -74.000000 | -44.000000 | 1.000000 |
| max | 2.845382e+06 | 4.562550e+05 | NaN | 4.180581e+05 | 6.905160e+06 | 6.905160e+06 | 3.060045e+06 | 6.905160e+06 | NaN | 2.300000e+01 | NaN | 1.000000e+00 | 1.000000 | 1.000000 | 1.000000 | NaN | NaN | -1.000000e+00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 4.000000e+06 | NaN | 8.400000e+01 | NaN | NaN | 365243.000000 | 365243.000000 | 365243.000000 | 365243.000000 | 365243.000000 | 1.000000 |
edaPAdf = edaPAdf.merge(datasets["previous_application"], how='left', on='SK_ID_CURR')
percent = (edaPAdf.isnull().sum()/edaPAdf.isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = edaPAdf.isna().sum().sort_values(ascending = False)
missing_previous_application_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Missing Count"])
missing_previous_application_data.head(20)
| Percent | Missing Count | |
|---|---|---|
| RATE_INTEREST_PRIMARY | 99.67 | 1425364 |
| RATE_INTEREST_PRIVILEGED | 99.67 | 1425364 |
| AMT_DOWN_PAYMENT | 53.56 | 765994 |
| RATE_DOWN_PAYMENT | 53.56 | 765994 |
| NAME_TYPE_SUITE | 49.72 | 711126 |
| NFLAG_INSURED_ON_APPROVAL | 40.38 | 577560 |
| DAYS_TERMINATION | 40.38 | 577560 |
| DAYS_LAST_DUE | 40.38 | 577560 |
| DAYS_LAST_DUE_1ST_VERSION | 40.38 | 577560 |
| DAYS_FIRST_DUE | 40.38 | 577560 |
| DAYS_FIRST_DRAWING | 40.38 | 577560 |
| AMT_GOODS_PRICE | 23.49 | 335979 |
| AMT_ANNUITY | 22.63 | 323672 |
| CNT_PAYMENT | 22.63 | 323667 |
| PRODUCT_COMBINATION | 1.17 | 16767 |
| AMT_CREDIT | 1.15 | 16455 |
| NAME_YIELD_GROUP | 1.15 | 16454 |
| NAME_GOODS_CATEGORY | 1.15 | 16454 |
| NAME_SELLER_INDUSTRY | 1.15 | 16454 |
| SELLERPLACE_AREA | 1.15 | 16454 |
# Filter rows with count of missing values greater than 0
missing_previous_application_data = missing_previous_application_data[missing_previous_application_data['Missing Count'] > 0]
print("Number of columns with missing value greater than 0:",len(missing_previous_application_data))
missing_more_than_60=missing_previous_application_data[missing_previous_application_data['Percent'] >=60]
print("Number of columns with more than 60% missing values:", len(missing_more_than_60))
plt.figure(figsize=(15, 20))
sns.set_style("whitegrid")
ax = sns.barplot(x="Missing Count", y=missing_previous_application_data.index, data=missing_previous_application_data,
palette="rocket")
# Set the figure size and axis labels
ax.set_xlabel("Number of Missing Values ")
ax.set_ylabel("Features")
plt.title('Missing values in previous application columns')
# Add values to the bars
for i, v in enumerate(missing_previous_application_data['Missing Count']):
ax.text(v + 10, i + .25, str(v), color='black', fontweight='bold')
# Show the plot
plt.show()
Number of columns with missing value greater than 0: 36 Number of columns with more than 60% missing values: 2
4 columns have more than 50% of the data missing, we can see how to deal with these columns as two of these 4 columns have 99% data missing and it would make no sense to keep these columns but we will check it out in the next phase about what could be done and if there are any alternatives that we can consider.
edaPAdf.corr()['TARGET']
SK_ID_CURR -0.001216 TARGET 1.000000 SK_ID_PREV 0.002009 AMT_ANNUITY -0.014922 AMT_APPLICATION -0.005583 AMT_CREDIT -0.002350 AMT_DOWN_PAYMENT -0.016918 AMT_GOODS_PRICE 0.000254 HOUR_APPR_PROCESS_START -0.027809 NFLAG_LAST_APPL_IN_DAY -0.002887 RATE_DOWN_PAYMENT -0.026111 RATE_INTEREST_PRIMARY -0.001470 RATE_INTEREST_PRIVILEGED 0.028640 DAYS_DECISION 0.039901 SELLERPLACE_AREA -0.002539 CNT_PAYMENT 0.030480 DAYS_FIRST_DRAWING -0.031154 DAYS_FIRST_DUE -0.006651 DAYS_LAST_DUE_1ST_VERSION 0.018021 DAYS_LAST_DUE 0.017522 DAYS_TERMINATION 0.016981 NFLAG_INSURED_ON_APPROVAL 0.000653 Name: TARGET, dtype: float64
correlations = edaPAdf.corr()
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(correlations, cmap = cmap)
<Axes: >
fig, ax = plt.subplots(figsize=(6, 6))
contract_type_counts = edaPAdf['NAME_CONTRACT_TYPE'].value_counts()
contract_type_labels = contract_type_counts.index
contract_type_sizes = contract_type_counts.values
colors = ["#73A4CA", "#497AA7", "#2E5B88"]
ax.pie(contract_type_sizes, labels=contract_type_labels, autopct='%1.1f%%', startangle=90, textprops={'fontsize': 12}, colors=colors)
# Add a common title for the figure
ax.set_title('Contract type of the previous applications', fontsize=16)
# Set the aspect ratio to "equal" and the x and y limits to the same value to make it a perfect circle
ax.set_aspect('equal')
ax.set_xlim([-1.1, 1.1])
ax.set_ylim([-1.1, 1.1])
# Show the plot
plt.show()
sns.countplot(x='NAME_CONTRACT_STATUS', data=edaPAdf, palette = "Set3");
plt.title('Contract status of the previous applicant');
plt.xlabel('Contract Status')
plt.xticks(rotation=90);
From the graph we can deduce that atleast 60% of the applications got approved.
fig, axs = plt.subplots(ncols=1, figsize=(12, 6))
sns.countplot(x='NAME_CLIENT_TYPE', hue='TARGET', data=edaPAdf, palette = 'rocket')
axs.set_title('Client old or new vs Target')
plt.xlabel('Client Type')
axs.set_ylabel('Count')
plt.show()
We can observe that we dont have many new clients but have a lot of repeater clients and most of these repeater clients got their loan approved.
plt.figure(figsize=(12, 6))
sns.distplot(x=edaPAdf['AMT_CREDIT'], color='maroon')
plt.xlabel('Credit')
plt.title('Distribution plot for Credit')
plt.show()
edaPCBdf = pd.DataFrame()
edaPCBdf['SK_ID_CURR'] = datasets['application_train']['SK_ID_CURR']
edaPCBdf['TARGET'] = datasets['application_train']['TARGET']
datasets['POS_CASH_balance'].shape
(10001358, 8)
datasets['POS_CASH_balance'].info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10001358 entries, 0 to 10001357 Data columns (total 8 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 MONTHS_BALANCE int64 3 CNT_INSTALMENT float64 4 CNT_INSTALMENT_FUTURE float64 5 NAME_CONTRACT_STATUS object 6 SK_DPD int64 7 SK_DPD_DEF int64 dtypes: float64(2), int64(5), object(1) memory usage: 610.4+ MB
datasets['POS_CASH_balance'].describe()
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | CNT_INSTALMENT | CNT_INSTALMENT_FUTURE | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|
| count | 1.000136e+07 | 1.000136e+07 | 1.000136e+07 | 9.975287e+06 | 9.975271e+06 | 1.000136e+07 | 1.000136e+07 |
| mean | 1.903217e+06 | 2.784039e+05 | -3.501259e+01 | 1.708965e+01 | 1.048384e+01 | 1.160693e+01 | 6.544684e-01 |
| std | 5.358465e+05 | 1.027637e+05 | 2.606657e+01 | 1.199506e+01 | 1.110906e+01 | 1.327140e+02 | 3.276249e+01 |
| min | 1.000001e+06 | 1.000010e+05 | -9.600000e+01 | 1.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| 25% | 1.434405e+06 | 1.895500e+05 | -5.400000e+01 | 1.000000e+01 | 3.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| 50% | 1.896565e+06 | 2.786540e+05 | -2.800000e+01 | 1.200000e+01 | 7.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| 75% | 2.368963e+06 | 3.674290e+05 | -1.300000e+01 | 2.400000e+01 | 1.400000e+01 | 0.000000e+00 | 0.000000e+00 |
| max | 2.843499e+06 | 4.562550e+05 | -1.000000e+00 | 9.200000e+01 | 8.500000e+01 | 4.231000e+03 | 3.595000e+03 |
datasets['POS_CASH_balance'].describe(include='all')
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | CNT_INSTALMENT | CNT_INSTALMENT_FUTURE | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|
| count | 1.000136e+07 | 1.000136e+07 | 1.000136e+07 | 9.975287e+06 | 9.975271e+06 | 10001358 | 1.000136e+07 | 1.000136e+07 |
| unique | NaN | NaN | NaN | NaN | NaN | 9 | NaN | NaN |
| top | NaN | NaN | NaN | NaN | NaN | Active | NaN | NaN |
| freq | NaN | NaN | NaN | NaN | NaN | 9151119 | NaN | NaN |
| mean | 1.903217e+06 | 2.784039e+05 | -3.501259e+01 | 1.708965e+01 | 1.048384e+01 | NaN | 1.160693e+01 | 6.544684e-01 |
| std | 5.358465e+05 | 1.027637e+05 | 2.606657e+01 | 1.199506e+01 | 1.110906e+01 | NaN | 1.327140e+02 | 3.276249e+01 |
| min | 1.000001e+06 | 1.000010e+05 | -9.600000e+01 | 1.000000e+00 | 0.000000e+00 | NaN | 0.000000e+00 | 0.000000e+00 |
| 25% | 1.434405e+06 | 1.895500e+05 | -5.400000e+01 | 1.000000e+01 | 3.000000e+00 | NaN | 0.000000e+00 | 0.000000e+00 |
| 50% | 1.896565e+06 | 2.786540e+05 | -2.800000e+01 | 1.200000e+01 | 7.000000e+00 | NaN | 0.000000e+00 | 0.000000e+00 |
| 75% | 2.368963e+06 | 3.674290e+05 | -1.300000e+01 | 2.400000e+01 | 1.400000e+01 | NaN | 0.000000e+00 | 0.000000e+00 |
| max | 2.843499e+06 | 4.562550e+05 | -1.000000e+00 | 9.200000e+01 | 8.500000e+01 | NaN | 4.231000e+03 | 3.595000e+03 |
edaPCBdf = edaPCBdf.merge(datasets["POS_CASH_balance"], how='left', on='SK_ID_CURR')
percent = (edaPCBdf.isnull().sum()/edaPCBdf.isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = edaPCBdf.isna().sum().sort_values(ascending = False)
missing_POS_CASH_balance_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Missing Count"])
missing_POS_CASH_balance_data.head(20)
| Percent | Missing Count | |
|---|---|---|
| CNT_INSTALMENT_FUTURE | 0.47 | 39945 |
| CNT_INSTALMENT | 0.47 | 39930 |
| SK_ID_PREV | 0.21 | 18067 |
| MONTHS_BALANCE | 0.21 | 18067 |
| NAME_CONTRACT_STATUS | 0.21 | 18067 |
| SK_DPD | 0.21 | 18067 |
| SK_DPD_DEF | 0.21 | 18067 |
| SK_ID_CURR | 0.00 | 0 |
| TARGET | 0.00 | 0 |
# Filter rows with count of missing values greater than 0
missing_POS_CASH_balance_data = missing_POS_CASH_balance_data[missing_POS_CASH_balance_data['Missing Count'] > 0]
print("Number of columns with missing value greater than 0:",len(missing_POS_CASH_balance_data))
missing_more_than_60=missing_POS_CASH_balance_data[missing_POS_CASH_balance_data['Percent'] >=60]
print("Number of columns with more than 60% missing values:", len(missing_more_than_60))
plt.figure(figsize=(10, 6))
sns.set_style("whitegrid")
ax = sns.barplot(x="Missing Count", y=missing_POS_CASH_balance_data.index, data=missing_POS_CASH_balance_data,
palette="rocket")
# Set the figure size and axis labels
ax.set_xlabel("Number of Missing Values ")
ax.set_ylabel("Features")
plt.title('Missing values in POS CASH Balance columns')
# Add values to the bars
for i, v in enumerate(missing_POS_CASH_balance_data['Missing Count']):
ax.text(v + 10, i + .25, str(v), color='black', fontweight='bold')
# Show the plot
plt.show()
Number of columns with missing value greater than 0: 7 Number of columns with more than 60% missing values: 0
There are barely any missing values in this dataset.
edaPCBdf.corr()['TARGET']
SK_ID_CURR -0.002244 TARGET 1.000000 SK_ID_PREV -0.000056 MONTHS_BALANCE 0.020147 CNT_INSTALMENT 0.018506 CNT_INSTALMENT_FUTURE 0.021972 SK_DPD 0.009866 SK_DPD_DEF 0.008594 Name: TARGET, dtype: float64
correlations = edaPCBdf.corr()
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(correlations, cmap = cmap)
<Axes: >
fig, axs = plt.subplots(ncols=1, figsize=(10, 25))
sns.countplot(y='CNT_INSTALMENT_FUTURE', hue='TARGET', data=edaPCBdf, palette = 'Set1')
axs.set_title('Installments left to pay vs Target')
axs.set_xlabel('Count')
axs.set_ylabel('Installments left to pay')
plt.show()
Most of the clients have less than 10 installments left to pay and these people have the highest rate of repayment of their loans.
fig, axs = plt.subplots(ncols=1, figsize=(15, 7))
sns.countplot(x='NAME_CONTRACT_STATUS', data=edaPCBdf, palette = 'crest')
axs.set_title('Contract status during the month')
axs.set_ylabel('Count')
plt.show()
Almost 90% of the contracts are active and around 9% are completed and 1% are just signed. (percentages are just written based on the visual pov)
datasets["installments_payments"].info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 13605401 entries, 0 to 13605400 Data columns (total 8 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 NUM_INSTALMENT_VERSION float64 3 NUM_INSTALMENT_NUMBER int64 4 DAYS_INSTALMENT float64 5 DAYS_ENTRY_PAYMENT float64 6 AMT_INSTALMENT float64 7 AMT_PAYMENT float64 dtypes: float64(5), int64(3) memory usage: 830.4 MB
datasets["installments_payments"].describe(include='all')
| SK_ID_PREV | SK_ID_CURR | NUM_INSTALMENT_VERSION | NUM_INSTALMENT_NUMBER | DAYS_INSTALMENT | DAYS_ENTRY_PAYMENT | AMT_INSTALMENT | AMT_PAYMENT | |
|---|---|---|---|---|---|---|---|---|
| count | 1.360540e+07 | 1.360540e+07 | 1.360540e+07 | 1.360540e+07 | 1.360540e+07 | 1.360250e+07 | 1.360540e+07 | 1.360250e+07 |
| mean | 1.903365e+06 | 2.784449e+05 | 8.566373e-01 | 1.887090e+01 | -1.042270e+03 | -1.051114e+03 | 1.705091e+04 | 1.723822e+04 |
| std | 5.362029e+05 | 1.027183e+05 | 1.035216e+00 | 2.666407e+01 | 8.009463e+02 | 8.005859e+02 | 5.057025e+04 | 5.473578e+04 |
| min | 1.000001e+06 | 1.000010e+05 | 0.000000e+00 | 1.000000e+00 | -2.922000e+03 | -4.921000e+03 | 0.000000e+00 | 0.000000e+00 |
| 25% | 1.434191e+06 | 1.896390e+05 | 0.000000e+00 | 4.000000e+00 | -1.654000e+03 | -1.662000e+03 | 4.226085e+03 | 3.398265e+03 |
| 50% | 1.896520e+06 | 2.786850e+05 | 1.000000e+00 | 8.000000e+00 | -8.180000e+02 | -8.270000e+02 | 8.884080e+03 | 8.125515e+03 |
| 75% | 2.369094e+06 | 3.675300e+05 | 1.000000e+00 | 1.900000e+01 | -3.610000e+02 | -3.700000e+02 | 1.671021e+04 | 1.610842e+04 |
| max | 2.843499e+06 | 4.562550e+05 | 1.780000e+02 | 2.770000e+02 | -1.000000e+00 | -1.000000e+00 | 3.771488e+06 | 3.771488e+06 |
percent = (datasets["installments_payments"].isnull().sum()/datasets["installments_payments"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["installments_payments"].isna().sum().sort_values(ascending = False)
total_count_with = datasets["installments_payments"].isnull().count()
total_count_without = datasets["installments_payments"].count()
missing_application_train_data = pd.concat([percent, sum_missing, total_count_without, total_count_with], axis=1, keys=['Percent', 'Missing Count', 'Total Count without Missing value', 'Total Count'])
missing_application_train_data
| Percent | Missing Count | Total Count without Missing value | Total Count | |
|---|---|---|---|---|
| DAYS_ENTRY_PAYMENT | 0.02 | 2905 | 13602496 | 13605401 |
| AMT_PAYMENT | 0.02 | 2905 | 13602496 | 13605401 |
| SK_ID_PREV | 0.00 | 0 | 13605401 | 13605401 |
| SK_ID_CURR | 0.00 | 0 | 13605401 | 13605401 |
| NUM_INSTALMENT_VERSION | 0.00 | 0 | 13605401 | 13605401 |
| NUM_INSTALMENT_NUMBER | 0.00 | 0 | 13605401 | 13605401 |
| DAYS_INSTALMENT | 0.00 | 0 | 13605401 | 13605401 |
| AMT_INSTALMENT | 0.00 | 0 | 13605401 | 13605401 |
percent = (datasets['installments_payments'].isnull().sum()/datasets['installments_payments'].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets['installments_payments'].isna().sum().sort_values(ascending = False)
missing_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Missing Count"])
missing_data=missing_data[missing_data['Percent'] > 0]
fig, ax = plt.subplots()
plt.title(f'Missing data for installments_payments in percent', fontsize=10)
sns.barplot(x = missing_data["Percent"],y = missing_data.index ,alpha=0.8)
plt.xlabel('Missing values (%)', fontsize=10)
plt.ylabel('Features', fontsize=10)
Text(0, 0.5, 'Features')
inst_pay = pd.DataFrame()
inst_pay['AMT_DIFF'] = datasets["installments_payments"]['AMT_INSTALMENT'] - datasets["installments_payments"]['AMT_PAYMENT']
inst_pay['OVER_PAYMENT'] = inst_pay['AMT_DIFF'].apply(lambda x: 1 if x < 0 else 0)
inst_pay['PARTIAL_PAYMENT'] = inst_pay['AMT_DIFF'].apply(lambda x: 1 if x > 0 else 0)
inst_pay['FULL_PAYMENT'] = inst_pay['AMT_DIFF'].apply(lambda x: 1 if x == 0 else 0)
full_payment_pct = inst_pay['FULL_PAYMENT'].sum() / inst_pay.shape[0] * 100
partial_payment_pct = inst_pay['PARTIAL_PAYMENT'].sum() / inst_pay.shape[0] * 100
over_payment_pct = inst_pay['OVER_PAYMENT'].sum() / inst_pay.shape[0] * 100
labels = ['Full Payment', 'Partial Payment', 'Over Payment']
x = [full_payment_pct, partial_payment_pct, over_payment_pct]
colors = ['#FF671F','#046A38','#06038D']
explode = (0, 0.05, -0.02)
wedge_properties = {"edgecolor":"k",'width': .25}
plt.pie(x, labels=labels, explode=explode, colors=colors,
startangle=150, counterclock=False, shadow=True,
wedgeprops=wedge_properties, autopct="%1.1f%%")
plt.title("Distribution of Installment Payment Scenarios")
plt.show()
payment_counts = [inst_pay['FULL_PAYMENT'].sum(), inst_pay['PARTIAL_PAYMENT'].sum(), inst_pay['OVER_PAYMENT'].sum()]
payment_labels = ['Full Payment', 'Partial Payment', 'Over Payment']
columns = {
'Payment Scenarios': payment_labels,
'Count': payment_counts
}
df = pd.DataFrame(columns)
fig, ax = plt.subplots()
sns.barplot(x = 'Payment Scenarios', y = 'Count', data = df, ci=95, ax=ax, palette = 'plasma')
ax.set_title('Installment Payment Scenarios')
ax.tick_params(labelsize=10,length=0)
plt.box(False)
plt.ticklabel_format(style='plain', axis='y')
ax.yaxis.grid(linewidth=0.5,color='black')
ax.set_axisbelow(True)
for index, row in df.iterrows():
ax.text(row.name, row.Count, round(row.Count), color='black', ha='center')
installment_versions = datasets["installments_payments"]["NUM_INSTALMENT_VERSION"].value_counts()
installment_versions = installment_versions[installment_versions > 1000]
columns = {
'Installment Versions': installment_versions.index.astype('int64'),
'Count': installment_versions.values
}
df = pd.DataFrame(columns)
fig, ax = plt.subplots()
sns.barplot(x = 'Installment Versions', y = 'Count', data = df)
ax.set_title('Count of Different Installment Versions')
plt.xticks(rotation=90)
ax.tick_params(labelsize=10,length=0)
plt.box(False)
plt.ticklabel_format(style='plain', axis='y')
ax.yaxis.grid(linewidth=0.5,color='black')
ax.set_axisbelow(True)
total_payments = datasets["installments_payments"]["AMT_PAYMENT"].sum()
payments_by_installment = datasets["installments_payments"].groupby("NUM_INSTALMENT_NUMBER")["AMT_PAYMENT"].sum()
percentages = payments_by_installment / total_payments * 100
percentages = percentages[percentages > 2.0]
plt.pie(percentages.values, labels=percentages.index, autopct='%1.1f%%')
plt.title("Percentage of Total Payments Made by Installment Number")
plt.show()
merged_df = datasets['application_train'][['SK_ID_CURR', 'TARGET']].merge(datasets['installments_payments'], on='SK_ID_CURR', how='left')
merged_df.head(5)
| SK_ID_CURR | TARGET | SK_ID_PREV | NUM_INSTALMENT_VERSION | NUM_INSTALMENT_NUMBER | DAYS_INSTALMENT | DAYS_ENTRY_PAYMENT | AMT_INSTALMENT | AMT_PAYMENT | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | 1038818.0 | 1.0 | 16.0 | -115.0 | -133.0 | 9251.775 | 9251.775 |
| 1 | 100002 | 1 | 1038818.0 | 1.0 | 8.0 | -355.0 | -375.0 | 9251.775 | 9251.775 |
| 2 | 100002 | 1 | 1038818.0 | 2.0 | 19.0 | -25.0 | -49.0 | 53093.745 | 53093.745 |
| 3 | 100002 | 1 | 1038818.0 | 1.0 | 12.0 | -235.0 | -250.0 | 9251.775 | 9251.775 |
| 4 | 100002 | 1 | 1038818.0 | 1.0 | 9.0 | -325.0 | -344.0 | 9251.775 | 9251.775 |
merged_df.corr()['TARGET'].abs().sort_values(ascending= False)
TARGET 1.000000 DAYS_ENTRY_PAYMENT 0.035122 DAYS_INSTALMENT 0.034974 NUM_INSTALMENT_NUMBER 0.016190 NUM_INSTALMENT_VERSION 0.009896 AMT_PAYMENT 0.003623 SK_ID_CURR 0.002533 AMT_INSTALMENT 0.001498 SK_ID_PREV 0.000212 Name: TARGET, dtype: float64
fig = plt.figure(figsize=(12,10))
sns.heatmap(merged_df.corr(), annot=True)
<Axes: >
list(datasets.keys())
['application_train', 'application_test', 'bureau', 'bureau_balance', 'credit_card_balance', 'installments_payments', 'previous_application', 'POS_CASH_balance']
len(datasets["application_train"]["SK_ID_CURR"].unique()) == datasets["application_train"].shape[0]
True
# is there an overlap between the test and train customers
np.intersect1d(datasets["application_train"]["SK_ID_CURR"], datasets["application_test"]["SK_ID_CURR"])
array([], dtype=int64)
#
datasets["application_test"].shape
(48744, 118)
datasets["application_train"].shape
(307511, 119)
The persons in the kaggle submission file have had previous applications in the previous_application.csv. 47,800 out 48,744 people have had previous appications.
appsDF = datasets["previous_application"]
display(appsDF.head())
print(f"{appsDF.shape[0]:,} rows, {appsDF.shape[1]:,} columns")
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | FLAG_LAST_APPL_PER_CONTRACT | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | RATE_INTEREST_PRIMARY | RATE_INTEREST_PRIVILEGED | NAME_CASH_LOAN_PURPOSE | NAME_CONTRACT_STATUS | DAYS_DECISION | NAME_PAYMENT_TYPE | CODE_REJECT_REASON | NAME_TYPE_SUITE | NAME_CLIENT_TYPE | NAME_GOODS_CATEGORY | NAME_PORTFOLIO | NAME_PRODUCT_TYPE | CHANNEL_TYPE | SELLERPLACE_AREA | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | Consumer loans | 1730.430 | 17145.0 | 17145.0 | 0.0 | 17145.0 | SATURDAY | 15 | Y | 1 | 0.0 | 0.182832 | 0.867336 | XAP | Approved | -73 | Cash through the bank | XAP | NaN | Repeater | Mobile | POS | XNA | Country-wide | 35 | Connectivity | 12.0 | middle | POS mobile with interest | 365243.0 | -42.0 | 300.0 | -42.0 | -37.0 | 0.0 |
| 1 | 2802425 | 108129 | Cash loans | 25188.615 | 607500.0 | 679671.0 | NaN | 607500.0 | THURSDAY | 11 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -164 | XNA | XAP | Unaccompanied | Repeater | XNA | Cash | x-sell | Contact center | -1 | XNA | 36.0 | low_action | Cash X-Sell: low | 365243.0 | -134.0 | 916.0 | 365243.0 | 365243.0 | 1.0 |
| 2 | 2523466 | 122040 | Cash loans | 15060.735 | 112500.0 | 136444.5 | NaN | 112500.0 | TUESDAY | 11 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -301 | Cash through the bank | XAP | Spouse, partner | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.0 | high | Cash X-Sell: high | 365243.0 | -271.0 | 59.0 | 365243.0 | 365243.0 | 1.0 |
| 3 | 2819243 | 176158 | Cash loans | 47041.335 | 450000.0 | 470790.0 | NaN | 450000.0 | MONDAY | 7 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -512 | Cash through the bank | XAP | NaN | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.0 | middle | Cash X-Sell: middle | 365243.0 | -482.0 | -152.0 | -182.0 | -177.0 | 1.0 |
| 4 | 1784265 | 202054 | Cash loans | 31924.395 | 337500.0 | 404055.0 | NaN | 337500.0 | THURSDAY | 9 | Y | 1 | NaN | NaN | NaN | Repairs | Refused | -781 | Cash through the bank | HC | NaN | Repeater | XNA | Cash | walk-in | Credit and cash offices | -1 | XNA | 24.0 | high | Cash Street: high | NaN | NaN | NaN | NaN | NaN | NaN |
1,670,214 rows, 37 columns
print(f"There are {appsDF.shape[0]:,} previous applications")
There are 1,670,214 previous applications
#Find the intersection of two arrays.
print(f'Number of train applicants with previous applications is {len(np.intersect1d(datasets["previous_application"]["SK_ID_CURR"], datasets["application_train"]["SK_ID_CURR"])):,}')
Number of train applicants with previous applications is 291,057
#Find the intersection of two arrays.
print(f'Number of train applicants with previous applications is {len(np.intersect1d(datasets["previous_application"]["SK_ID_CURR"], datasets["application_test"]["SK_ID_CURR"])):,}')
Number of train applicants with previous applications is 47,800
# How many previous applciations per applicant in the previous_application
prevAppCounts = appsDF['SK_ID_CURR'].value_counts(dropna=False)
len(prevAppCounts[prevAppCounts >40]) #more that 40 previous applications
plt.hist(prevAppCounts[prevAppCounts>=0], bins=100)
plt.grid()
prevAppCounts[prevAppCounts >50].plot(kind='bar')
plt.xticks(rotation=90)
plt.show()
sum(appsDF['SK_ID_CURR'].value_counts()==1)
60458
plt.hist(appsDF['SK_ID_CURR'].value_counts(), cumulative =True, bins = 100);
plt.grid()
plt.ylabel('cumulative number of IDs')
plt.xlabel('Number of previous applications per ID')
plt.title('Histogram of Number of previous applications for an ID')
Text(0.5, 1.0, 'Histogram of Number of previous applications for an ID')
* Low = <5 claims (22%)
* Medium = 10 to 39 claims (58%)
* High = 40 or more claims (20%)
apps_all = appsDF['SK_ID_CURR'].nunique()
apps_5plus = appsDF['SK_ID_CURR'].value_counts()>=5
apps_40plus = appsDF['SK_ID_CURR'].value_counts()>=40
print('Percentage with 10 or more previous apps:', np.round(100.*(sum(apps_5plus)/apps_all),5))
print('Percentage with 40 or more previous apps:', np.round(100.*(sum(apps_40plus)/apps_all),5))
Percentage with 10 or more previous apps: 41.76895 Percentage with 40 or more previous apps: 0.03453
In the case of the HCDR competition (and many other machine learning problems that involve multiple tables in 3NF or not) we need to join these datasets (denormalize) when using a machine learning pipeline. Joining the secondary tables with the primary table will lead to lots of new features about each loan application; these features will tend to be aggregate type features or meta data about the loan or its application. How can we do this when using Machine Learning Pipelines?
previous_application with application_x¶We refer to the application_train data (and also application_test data also) as the primary table and the other files as the secondary tables (e.g., previous_application dataset). All tables can be joined using the primary key SK_ID_PREV.
Let's assume we wish to generate a feature based on previous application attempts. In this case, possible features here could be:
AMT_APPLICATION, AMT_CREDIT could be based on average, min, max, median, etc.To build such features, we need to join the application_train data (and also application_test data also) with the 'previous_application' dataset (and the other available datasets).
When joining this data in the context of pipelines, different strategies come to mind with various tradeoffs:
application_train data (the labeled dataset) and with the application_test data (the unlabeled submission dataset) prior to processing the data (in a train, valid, test partition) via your machine learning pipeline. [This approach is recommended for this HCDR competition. WHY?]I want you to think about this section and build on this.
application_train data (the labeled dataset) and with the application_test data (the unlabeled submission dataset)), thereby leading to X_train, y_train, X_valid, etc.appsDF[0:50][(appsDF["SK_ID_CURR"]==175704)]
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | FLAG_LAST_APPL_PER_CONTRACT | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | RATE_INTEREST_PRIMARY | RATE_INTEREST_PRIVILEGED | NAME_CASH_LOAN_PURPOSE | NAME_CONTRACT_STATUS | DAYS_DECISION | NAME_PAYMENT_TYPE | CODE_REJECT_REASON | NAME_TYPE_SUITE | NAME_CLIENT_TYPE | NAME_GOODS_CATEGORY | NAME_PORTFOLIO | NAME_PRODUCT_TYPE | CHANNEL_TYPE | SELLERPLACE_AREA | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 2315218 | 175704 | Cash loans | NaN | 0.0 | 0.0 | NaN | NaN | TUESDAY | 11 | Y | 1 | NaN | NaN | NaN | XNA | Canceled | -14 | XNA | XAP | NaN | Repeater | XNA | XNA | XNA | Credit and cash offices | -1 | XNA | NaN | XNA | Cash | NaN | NaN | NaN | NaN | NaN | NaN |
appsDF[0:50][(appsDF["SK_ID_CURR"]==175704)]["AMT_CREDIT"]
6 0.0 Name: AMT_CREDIT, dtype: float64
appsDF[0:50][(appsDF["SK_ID_CURR"]==175704) & ~(appsDF["AMT_CREDIT"]==1.0)]
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | FLAG_LAST_APPL_PER_CONTRACT | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | RATE_INTEREST_PRIMARY | RATE_INTEREST_PRIVILEGED | NAME_CASH_LOAN_PURPOSE | NAME_CONTRACT_STATUS | DAYS_DECISION | NAME_PAYMENT_TYPE | CODE_REJECT_REASON | NAME_TYPE_SUITE | NAME_CLIENT_TYPE | NAME_GOODS_CATEGORY | NAME_PORTFOLIO | NAME_PRODUCT_TYPE | CHANNEL_TYPE | SELLERPLACE_AREA | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 2315218 | 175704 | Cash loans | NaN | 0.0 | 0.0 | NaN | NaN | TUESDAY | 11 | Y | 1 | NaN | NaN | NaN | XNA | Canceled | -14 | XNA | XAP | NaN | Repeater | XNA | XNA | XNA | Credit and cash offices | -1 | XNA | NaN | XNA | Cash | NaN | NaN | NaN | NaN | NaN | NaN |
#appsDF['agg_op_features'].head()
The groupby output will have an index or multi-index on rows corresponding to your chosen grouping variables. To avoid setting this index, pass “as_index=False” to the groupby operation.
import pandas as pd
import dateutil
# Load data from csv file
data = pd.DataFrame.from_csv('phone_data.csv')
# Convert date from string to date times
data['date'] = data['date'].apply(dateutil.parser.parse, dayfirst=True)
data.groupby('month', as_index=False).agg({"duration": "sum"})
Pandas reset_index() to convert Multi-Index to Columns
We can simplify the multi-index dataframe using reset_index() function in Pandas. By default, Pandas reset_index() converts the indices to columns.
Since we have both the variable name and the operation performed in two rows in the Multi-Index dataframe, we can use that and name our new columns correctly.
For more details unstacking groupby results and examples please see here
For more details and examples please see here
features = ['AMT_ANNUITY', 'AMT_APPLICATION']
print(f"{appsDF[features].describe()}")
agg_ops = ["min", "max", "mean"]
result = appsDF.groupby(["SK_ID_CURR"], as_index=False).agg("mean") #group by ID
display(result.head())
print("-"*50)
result = appsDF.groupby(["SK_ID_CURR"], as_index=False).agg({'AMT_ANNUITY' : agg_ops, 'AMT_APPLICATION' : agg_ops})
result.columns = result.columns.map('_'.join)
display(result)
result['range_AMT_APPLICATION'] = result['AMT_APPLICATION_max'] - result['AMT_APPLICATION_min']
print(f"result.shape: {result.shape}")
result[0:10]
AMT_ANNUITY AMT_APPLICATION count 1.297979e+06 1.670214e+06 mean 1.595512e+04 1.752339e+05 std 1.478214e+04 2.927798e+05 min 0.000000e+00 0.000000e+00 25% 6.321780e+03 1.872000e+04 50% 1.125000e+04 7.104600e+04 75% 2.065842e+04 1.803600e+05 max 4.180581e+05 6.905160e+06
| SK_ID_CURR | SK_ID_PREV | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | HOUR_APPR_PROCESS_START | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | RATE_INTEREST_PRIMARY | RATE_INTEREST_PRIVILEGED | DAYS_DECISION | SELLERPLACE_AREA | CNT_PAYMENT | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | 1.369693e+06 | 3951.000 | 24835.50 | 23787.00 | 2520.0 | 24835.5 | 13.000000 | 1.0 | 0.104326 | NaN | NaN | -1740.0 | 23.0 | 8.0 | 365243.0 | -1709.000000 | -1499.000000 | -1619.000000 | -1612.000000 | 0.000000 |
| 1 | 100002 | 1.038818e+06 | 9251.775 | 179055.00 | 179055.00 | 0.0 | 179055.0 | 9.000000 | 1.0 | 0.000000 | NaN | NaN | -606.0 | 500.0 | 24.0 | 365243.0 | -565.000000 | 125.000000 | -25.000000 | -17.000000 | 0.000000 |
| 2 | 100003 | 2.281150e+06 | 56553.990 | 435436.50 | 484191.00 | 3442.5 | 435436.5 | 14.666667 | 1.0 | 0.050030 | NaN | NaN | -1305.0 | 533.0 | 10.0 | 365243.0 | -1274.333333 | -1004.333333 | -1054.333333 | -1047.333333 | 0.666667 |
| 3 | 100004 | 1.564014e+06 | 5357.250 | 24282.00 | 20106.00 | 4860.0 | 24282.0 | 5.000000 | 1.0 | 0.212008 | NaN | NaN | -815.0 | 30.0 | 4.0 | 365243.0 | -784.000000 | -694.000000 | -724.000000 | -714.000000 | 0.000000 |
| 4 | 100005 | 2.176837e+06 | 4813.200 | 22308.75 | 20076.75 | 4464.0 | 44617.5 | 10.500000 | 1.0 | 0.108964 | NaN | NaN | -536.0 | 18.0 | 12.0 | 365243.0 | -706.000000 | -376.000000 | -466.000000 | -460.000000 | 0.000000 |
--------------------------------------------------
| SK_ID_CURR_ | AMT_ANNUITY_min | AMT_ANNUITY_max | AMT_ANNUITY_mean | AMT_APPLICATION_min | AMT_APPLICATION_max | AMT_APPLICATION_mean | |
|---|---|---|---|---|---|---|---|
| 0 | 100001 | 3951.000 | 3951.000 | 3951.000000 | 24835.5 | 24835.5 | 24835.500 |
| 1 | 100002 | 9251.775 | 9251.775 | 9251.775000 | 179055.0 | 179055.0 | 179055.000 |
| 2 | 100003 | 6737.310 | 98356.995 | 56553.990000 | 68809.5 | 900000.0 | 435436.500 |
| 3 | 100004 | 5357.250 | 5357.250 | 5357.250000 | 24282.0 | 24282.0 | 24282.000 |
| 4 | 100005 | 4813.200 | 4813.200 | 4813.200000 | 0.0 | 44617.5 | 22308.750 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 338852 | 456251 | 6605.910 | 6605.910 | 6605.910000 | 40455.0 | 40455.0 | 40455.000 |
| 338853 | 456252 | 10074.465 | 10074.465 | 10074.465000 | 57595.5 | 57595.5 | 57595.500 |
| 338854 | 456253 | 3973.095 | 5567.715 | 4770.405000 | 19413.0 | 28912.5 | 24162.750 |
| 338855 | 456254 | 2296.440 | 19065.825 | 10681.132500 | 18846.0 | 223789.5 | 121317.750 |
| 338856 | 456255 | 2250.000 | 54022.140 | 20775.391875 | 45000.0 | 1170000.0 | 362770.875 |
338857 rows × 7 columns
result.shape: (338857, 8)
| SK_ID_CURR_ | AMT_ANNUITY_min | AMT_ANNUITY_max | AMT_ANNUITY_mean | AMT_APPLICATION_min | AMT_APPLICATION_max | AMT_APPLICATION_mean | range_AMT_APPLICATION | |
|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | 3951.000 | 3951.000 | 3951.000000 | 24835.5 | 24835.5 | 24835.500000 | 0.0 |
| 1 | 100002 | 9251.775 | 9251.775 | 9251.775000 | 179055.0 | 179055.0 | 179055.000000 | 0.0 |
| 2 | 100003 | 6737.310 | 98356.995 | 56553.990000 | 68809.5 | 900000.0 | 435436.500000 | 831190.5 |
| 3 | 100004 | 5357.250 | 5357.250 | 5357.250000 | 24282.0 | 24282.0 | 24282.000000 | 0.0 |
| 4 | 100005 | 4813.200 | 4813.200 | 4813.200000 | 0.0 | 44617.5 | 22308.750000 | 44617.5 |
| 5 | 100006 | 2482.920 | 39954.510 | 23651.175000 | 0.0 | 688500.0 | 272203.260000 | 688500.0 |
| 6 | 100007 | 1834.290 | 22678.785 | 12278.805000 | 17176.5 | 247500.0 | 150530.250000 | 230323.5 |
| 7 | 100008 | 8019.090 | 25309.575 | 15839.696250 | 0.0 | 450000.0 | 155701.800000 | 450000.0 |
| 8 | 100009 | 7435.845 | 17341.605 | 10051.412143 | 40455.0 | 110160.0 | 76741.714286 | 69705.0 |
| 9 | 100010 | 27463.410 | 27463.410 | 27463.410000 | 247212.0 | 247212.0 | 247212.000000 | 0.0 |
# Create aggregate features (via pipeline)
class FeaturesAggregater(BaseEstimator, TransformerMixin):
def __init__(self, features=None): # no *args or **kargs
self.features = features
self.agg_op_features = {}
for f in features:
#self.agg_op_features[f] = {f"{f}_{func}":func for func in ["min", "max", "mean"]}
self.agg_op_features[f]=["min", "max", "mean","median"]
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
#from IPython.core.debugger import Pdb as pdb; pdb().set_trace() #breakpoint; dont forget to quit
result = X.groupby(["SK_ID_CURR"]).agg(self.agg_op_features)
#result.columns = result.columns.droplevel()
result.columns = ["_".join(x) for x in result.columns.ravel()]
result = result.reset_index(level=["SK_ID_CURR"])
return result # return dataframe with the join key "SK_ID_CURR"
The features selected for analysis based on their higher correlation with the TARGET column in different datasets like bureau, credit_card_balance, previous_application, POS_CASH_balance, and installments_payments. These features provide insights into the borrower's payment behavior, creditworthiness, payment patterns, and repayment risk, which can aid in making informed decisions regarding loan approvals, risk assessments, and borrower management.
We selected DAYS_CREDIT, DAYS_CREDIT_UPDATE, and DAYS_ENDDATE_FACT features from the bureau dataset. The correlation analysis indicates that borrowers who have longer periods since the last credit inquiry (DAYS_CREDIT), or longer periods since the last update of credit-related information (DAYS_CREDIT_UPDATE), or longer periods since the end of the credit line (DAYS_ENDDATE_FACT) are slightly more likely to have difficulty repaying the loan.
Similarly, we selected MONTHS_BALANCE, AMT_BALANCE, and CNT_DRAWINGS_ATM_CURRENT features from credit_card_balance. The correlation analysis suggests that borrowers who have a higher balance (AMT_BALANCE) or more ATM withdrawals (CNT_DRAWINGS_ATM_CURRENT) are slightly more likely to have difficulty in repaying the loan. The MONTHS_BALANCE feature indicates the number of months between the current month and the month when the credit card balance was calculated, which could also impact the loan repayment ability of the borrower.
We selected AMT_CREDIT and CNT_PAYMENT features from previous_application. The AMT_CREDIT feature provides information on the total amount of credit requested by the borrower for the previous application, which can help in assessing the borrower's creditworthiness and repayment capacity. The CNT_PAYMENT feature indicates the number of payments that the borrower has committed to make, which can help in predicting the borrower's payment behavior and loan repayment risk. Analyzing the correlation between these features and the TARGET column can provide insights into the relationship between the borrower's creditworthiness, payment behavior, and loan repayment difficulty. This can aid in making informed decisions regarding loan approvals, risk assessments, and borrower management.
For analysis of the POS_CASH_balance dataset, we selected the features MONTHS_BALANCE, CNT_INSTALMENT_FUTURE, and CNT_INSTALMENT. The correlation analysis suggests that borrowers with a higher number of future installment payments (CNT_INSTALMENT_FUTURE) are slightly more likely to have difficulty repaying the loan. The MONTHS_BALANCE feature provides insight into the length of time the borrower has been making installment payments, which can be used to understand payment behavior and repayment risk. The CNT_INSTALMENT feature gives information about the number of installments in the payment plan, which can be used to identify payment patterns and potential risk factors.
We selected DAYS_ENTRY_PAYMENT, DAYS_INSTALMENT, and AMT_PAYMENT features from installments_payments. The correlation analysis shows that borrowers who pay later (DAYS_ENTRY_PAYMENT) or have longer installment periods (DAYS_INSTALMENT) are slightly more likely to have difficulty repaying the loan. AMT_PAYMENT helps to understand the relationship between borrower payment behavior and loan repayment difficulty. Additionally, it can be used to calculate other crucial features, such as late payment or underpayment amounts, providing insights into borrower payment behavior and loan repayment risk.
_#LOAD THE DATASET
X_train= datasets["application_train"]#primary dataset
appsDF = datasets["previous_application"] #prev app
burDF=datasets['bureau']
bureau_balDF=datasets['bureau_balance']
ccbDF=datasets['credit_card_balance']
ipDF=datasets['installments_payments']
posDF=datasets['POS_CASH_balance']
#SELECTED FEATURES ARE TAKEN FROM SECONDARY TABLES BASED ON CORRELATION ANALAYSIS
pr_features = ['AMT_CREDIT','CNT_PAYMENT']
br_features = ['DAYS_CREDIT', 'DAYS_CREDIT_UPDATE','DAYS_ENDDATE_FACT']
cc_features = ['MONTHS_BALANCE', 'AMT_BALANCE', 'CNT_DRAWINGS_ATM_CURRENT']
ip_features = ['DAYS_ENTRY_PAYMENT','DAYS_INSTALMENT', 'AMT_PAYMENT']
pos_features = ['MONTHS_BALANCE','CNT_INSTALMENT_FUTURE', 'CNT_INSTALMENT']
#PREVIOUS APPLICATION TRANSFORMATION
prevApps_feature_pipeline = Pipeline([
('prevApps_aggregater', FeaturesAggregater(pr_features)), # Aggregate across old and new features
])
prevApps_aggregated = prevApps_feature_pipeline.fit_transform(appsDF)
#Bureau Transformation
bureau_feature_pipeline = Pipeline([
('bureau_aggregater', FeaturesAggregater(br_features)), # Aggregate across old and new features
])
bureau_aggregated = bureau_feature_pipeline.fit_transform(burDF)
#Credit Card Balance Transformation
ccbal_feature_pipeline = Pipeline([
('ccbal_aggregater', FeaturesAggregater(cc_features)), # Aggregate across old and new features
])
ccbal_aggregated = ccbal_feature_pipeline.fit_transform(ccbDF)
#Installments Payments Transformation
inspay_feature_pipeline = Pipeline([
('inspay_aggregater', FeaturesAggregater(ip_features)), # Aggregate across old and new features
])
inspay_aggregated = inspay_feature_pipeline.fit_transform(ipDF)
#POS CASH Balance Transformation
poscashbal_feature_pipeline = Pipeline([
('poscashbal_aggregater', FeaturesAggregater(pos_features)), # Aggregate across old and new features
])
poscashbal_aggregated = poscashbal_feature_pipeline.fit_transform(posDF)
# merge primary table and secondary tables using features based on meta data and aggregage stats
merged_data=True
if merged_data:
# 1. Join/Merge in prevApps Data
X_train = X_train.merge(prevApps_aggregated, how='left', on='SK_ID_CURR')
# 2. Join/Merge in Bureau Data
X_train = X_train.merge(bureau_aggregated, how='left', on="SK_ID_CURR")
# 3. Join/Merge in credit_card_balance Data
X_train = X_train.merge(ccbal_aggregated, how='left', on="SK_ID_CURR")
# 4. Join/Merge in Aggregated installments_payments Data
X_train = X_train.merge(inspay_aggregated, how='left', on="SK_ID_CURR")
# 5. Join/Merge in Aggregated POS_CASH_balance Data
X_train = X_train.merge(poscashbal_aggregated, how='left', on="SK_ID_CURR")
if merged_data:
X_kaggle_test= datasets["application_test"]
X_kaggle_test = X_kaggle_test.merge(prevApps_aggregated, how='left', on='SK_ID_CURR')
X_kaggle_test = X_kaggle_test.merge(bureau_aggregated, how='left', on='SK_ID_CURR')
X_kaggle_test = X_kaggle_test.merge(ccbal_aggregated, how='left', on='SK_ID_CURR')
X_kaggle_test = X_kaggle_test.merge(inspay_aggregated, how='left', on='SK_ID_CURR')
X_kaggle_test = X_kaggle_test.merge(poscashbal_aggregated, how='left', on='SK_ID_CURR')
# Convert categorical features to numerical approximations (via pipeline)
class ClaimAttributesAdder(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
charlson_idx_dt = {'0': 0, '1-2': 2, '3-4': 4, '5+': 6}
los_dt = {'1 day': 1, '2 days': 2, '3 days': 3, '4 days': 4, '5 days': 5, '6 days': 6,
'1- 2 weeks': 11, '2- 4 weeks': 21, '4- 8 weeks': 42, '26+ weeks': 180}
X['PayDelay'] = X['PayDelay'].apply(lambda x: int(x) if x != '162+' else int(162))
X['DSFS'] = X['DSFS'].apply(lambda x: None if pd.isnull(x) else int(x[0]) + 1)
X['CharlsonIndex'] = X['CharlsonIndex'].apply(lambda x: charlson_idx_dt[x])
X['LengthOfStay'] = X['LengthOfStay'].apply(lambda x: None if pd.isnull(x) else los_dt[x])
return X
X_train['INCOME_CREDIT_PERCENT'] = (X_train.AMT_INCOME_TOTAL / X_train.AMT_CREDIT).replace(np.inf, 0)
X_train['REPAYMENT_PERCENT'] = (X_train.CNT_INSTALMENT_FUTURE_max / X_train.CNT_INSTALMENT_min).replace(np.inf, 0)
X_train['ANN_INC_PERCENT'] = (X_train.AMT_ANNUITY / X_train.AMT_INCOME_TOTAL).replace(np.inf, 0)
X_kaggle_test['INCOME_CREDIT_PERCENT'] = (X_kaggle_test.AMT_INCOME_TOTAL / X_kaggle_test.AMT_CREDIT).replace(np.inf, 0)
X_kaggle_test['REPAYMENT_PERCENT'] = (X_kaggle_test.CNT_INSTALMENT_FUTURE_max / X_kaggle_test.CNT_INSTALMENT_min).replace(np.inf, 0)
X_kaggle_test['ANN_INC_PERCENT'] = (X_kaggle_test.AMT_ANNUITY / X_kaggle_test.AMT_INCOME_TOTAL).replace(np.inf, 0)
This new column, named 'INCOME_CREDIT_PERCENT', will give us a measure of how much of the applicant's total income is being used to pay off the loan. A high percentage may indicate that the applicant is taking on too much debt relative to their income, which may increase the risk of default. Similarly, a low percentage may indicate that the applicant has a strong financial position and can comfortably repay the loan. Lenders often use this ratio to assess the borrower's creditworthiness and to set borrowing limits.
This new column, named 'REPAYMENT_PERCENT', can be used as a measure of the applicant's repayment history. If the ratio is high, it indicates that the applicant is able to pay off their loans quickly and is therefore less likely to default. However, if the ratio is low, it may indicate that the applicant struggles to pay off their loans on time, potentially indicating a higher risk of default.
This new column, named 'ANN_INC_PERCENT', is an important metric for assessing the applicant's financial stability and their ability to make timely payments, as it represents the percentage of their income that is being used to make annuity payments. A higher value for this ratio indicates a greater proportion of income being used towards the annuity payments, which may be a potential red flag for lenders as it increases the risk of default. On the other hand, a lower value for this ratio suggests that the applicant has more disposable income available for other expenses, and hence, may be less likely to default on their loan. Therefore, this new column 'ANN_INC_PERCENT' serves as an important feature for lenders to evaluate the creditworthiness of an applicant.
There are several techniques that handles class imbalance thing. This involves either oversampling the minority class or undersampling the majority class to balance the number of samples in each class. Oversampling can be done by replicating the minority class samples, while undersampling involves randomly removing samples from the majority class. There are several libraries in Python like imblearn that offer different techniques of resampling such as RandomOverSampler, RandomUnderSampler, SMOTE, etc. We tried both of them but difference is so huge that it's not feasible to do any because in one it's affecting training accuracy as during undersampling information gets lost and in oversampling, Server crashes as it was leading to many rows. So we decided to remove rows of majority class with more than 40% missing value.
mask = (X_train['TARGET'] == 0) & (X_train.isna().mean(axis=1) >= 0.4)
# drop rows that satisfy the condition
X_train = X_train.drop(X_train[mask].index)
X_train.shape
(287427, 178)
Rows went down from 307511 to 286483
from copy import deepcopy
percent = (X_train.isnull().sum()/X_train.isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = X_train.isna().sum().sort_values(ascending = False)
total_count_with = X_train.isnull().count()
total_count_without = X_train.count()
missing_application_train_data = pd.concat([percent, sum_missing, total_count_without, total_count_with], axis=1, keys=['Percent', 'Missing Count', 'Total Count without Missing value', 'Total Count'])
missing_application_train_data
| Percent | Missing Count | Total Count without Missing value | Total Count | |
|---|---|---|---|---|
| CNT_DRAWINGS_ATM_CURRENT_max | 78.73 | 225559 | 60924 | 286483 |
| CNT_DRAWINGS_ATM_CURRENT_mean | 78.73 | 225559 | 60924 | 286483 |
| CNT_DRAWINGS_ATM_CURRENT_median | 78.73 | 225559 | 60924 | 286483 |
| CNT_DRAWINGS_ATM_CURRENT_min | 78.73 | 225559 | 60924 | 286483 |
| MONTHS_BALANCE_median_x | 69.86 | 200129 | 86354 | 286483 |
| ... | ... | ... | ... | ... |
| REG_CITY_NOT_WORK_CITY | 0.00 | 0 | 286483 | 286483 |
| REG_CITY_NOT_LIVE_CITY | 0.00 | 0 | 286483 | 286483 |
| LIVE_REGION_NOT_WORK_REGION | 0.00 | 0 | 286483 | 286483 |
| REG_REGION_NOT_WORK_REGION | 0.00 | 0 | 286483 | 286483 |
| SK_ID_CURR | 0.00 | 0 | 286483 | 286483 |
181 rows × 4 columns
X_train_missing_data = pd.DataFrame((X_train.isnull().sum())/X_train.shape[0])
X_train_missing_col = X_train_missing_data[X_train_missing_data[0] > 0.6].index
X_train = X_train[[col for col in X_train.columns if col not in X_train_missing_col]]
X_train.shape
(286483, 152)
#ref: https://stackoverflow.com/questions/29294983/how-to-calculate-correlation-between-all-columns-and-remove-highly-correlated-on
X_train_corr_matrix = X_train.corr().abs()
# Select upper triangle of correlation matrix
upper = X_train_corr_matrix.where(np.triu(np.ones(X_train_corr_matrix.shape), k=1).astype(bool))
# Find features with correlation greater than 0.85
to_drop = [column for column in upper.columns if any(upper[column] > 0.85)]
# Drop features
X_train.drop(to_drop, axis=1, inplace=True)
X_train.shape
(286483, 107)
X_train2=deepcopy(X_train)
X_train=deepcopy(X_train2)
Please this blog for more details of OHE when the validation/test have previously unseen unique values.
# Split the provided training data into training and validationa and test
# The kaggle evaluation test set has no labels
#
from sklearn.model_selection import train_test_split
use_application_data_ONLY = False #use joined data
if use_application_data_ONLY:
# just selected a few features for a baseline experiment
selected_features1 = [col for col in datasets["application_train"].columns if col != 'TARGET']
selected_features2= selected_features1+br_features+pr_features+cc_features+ip_features+pos_features+['NAME_CONTRACT_TYPE']
selected_features3 = ['AMT_INCOME_TOTAL', 'AMT_CREDIT','DAYS_EMPLOYED','DAYS_BIRTH','EXT_SOURCE_1',
'EXT_SOURCE_2','EXT_SOURCE_3','CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE',
'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']
X_train = datasets["application_train"][selected_features2]
y_train = datasets["application_train"]['TARGET']
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.15, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.15, random_state=42)
X_kaggle_test= datasets["application_test"][selected_features2]
# y_test = datasets["application_test"]['TARGET'] #why no TARGET?!! (hint: kaggle competition)
else:
y_train = X_train['TARGET']
selected_features1 = [col for col in X_train.columns if col not in ['TARGET','SK_ID_CURR']]
selected_features = ['AMT_INCOME_TOTAL', 'AMT_CREDIT','DAYS_EMPLOYED','DAYS_BIRTH','EXT_SOURCE_1',
'EXT_SOURCE_2','EXT_SOURCE_3','CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE',
'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']
X_train = X_train[selected_features1]
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, stratify=y_train,test_size=0.15, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, stratify=y_train, test_size=0.15, random_state=42)
X_kaggle_test= X_kaggle_test[selected_features1]
# y_test = datasets["application_test"]['TARGET'] #why no TARGET?!! (hint: kaggle competition)
print(f"X train shape: {X_train.shape}")
print(f"X validation shape: {X_valid.shape}")
print(f"X test shape: {X_test.shape}")
print(f"X X_kaggle_test shape: {X_kaggle_test.shape}")
X train shape: (206983, 105) X validation shape: (42973, 105) X test shape: (36527, 105) X X_kaggle_test shape: (48744, 105)
X_train1=deepcopy(X_train)
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score
from datetime import datetime
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
import time
from xgboost import XGBClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix
from time import time
import xgboost
from xgboost import plot_importance
col=X_train1.columns
col
num_attribs1 = X_train1.select_dtypes(include=['int64', 'float64']).columns
imputer = SimpleImputer(strategy='mean')
X_train1[num_attribs1] = imputer.fit_transform(X_train1[num_attribs1])
cat_attribs1 = X_train1.select_dtypes(include=['object']).columns
label_encoder = LabelEncoder()
for col in cat_attribs1:
imputer = SimpleImputer(strategy='constant', fill_value='missing')
X_train1[col] = imputer.fit_transform(X_train1[[col]])
X_train1[col] = label_encoder.fit_transform(X_train1[col])
# Apply standard scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train1)
xgb_model = XGBClassifier()
xgb_model.fit(X_train_scaled, y_train)
# Calculate feature importances
importances = xgb_model.feature_importances_
col=X_train1.columns
# create a DataFrame of feature importances
importance_df = pd.DataFrame({
"Feature": col,
"Importance": importances
})
# sort the DataFrame by feature importance in descending order
importance_df = importance_df.sort_values("Importance", ascending=False)
importance_df
| Feature | Importance | |
|---|---|---|
| 80 | CNT_PAYMENT_min | 0.071660 |
| 46 | EMERGENCYSTATE_MODE | 0.054275 |
| 70 | AMT_REQ_CREDIT_BUREAU_HOUR | 0.049927 |
| 71 | AMT_REQ_CREDIT_BUREAU_DAY | 0.048500 |
| 35 | EXT_SOURCE_3 | 0.043497 |
| ... | ... | ... |
| 60 | FLAG_DOCUMENT_12 | 0.000000 |
| 58 | FLAG_DOCUMENT_10 | 0.000000 |
| 57 | FLAG_DOCUMENT_9 | 0.000000 |
| 55 | FLAG_DOCUMENT_7 | 0.000000 |
| 52 | FLAG_DOCUMENT_4 | 0.000000 |
105 rows × 2 columns
# Sort the feature importances in ascending order
sorted_idx = importances.argsort()
# Create the bar plot
fig, ax = plt.subplots(figsize=(15, 20))
ax.barh(col[sorted_idx], importances[sorted_idx])
# Add annotations to the plot
for i, v in enumerate(importances[sorted_idx]):
ax.text(v, i, "{:.3f}".format(v), color='blue', fontweight='bold', fontsize=10)
ax.set_xlabel("XGBoost Feature Importance")
plt.show()
threshold = 0.005
importance_df_filtered = importance_df[importance_df["Importance"] > threshold]
final_selected_feature=importance_df_filtered['Feature'].values
X_train_final=X_train[final_selected_feature]
X_test_final=X_test[final_selected_feature]
X_valid_final=X_valid[final_selected_feature]
X_kaggle_test1=deepcopy(X_kaggle_test)
X_kaggle_test=X_kaggle_test[final_selected_feature]
X_train_final.to_csv("X_train.csv")
X_test_final.to_csv("X_test.csv")
X_valid_final.to_csv("X_valid.csv")
X_kaggle_test.to_csv("X_kaggle_test.csv")
y_train.to_csv("y_train.csv")
y_test.to_csv("y_test.csv")
y_valid.to_csv("y_valid.csv")
# Create a class to select numerical or categorical columns
# since Scikit-Learn doesn't handle DataFrames yet
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
numerical_features = X_train_final.select_dtypes(include=[np.number])
categorical_features = X_train_final.select_dtypes(exclude=[np.number])
numerical_features = list(numerical_features.columns)
categorical_features = list(categorical_features.columns)
print("=" * 30)
print("| Numerical Features |")
print("=" * 30)
for features in numerical_features:
print("| " + features)
print("=" * 30)
print("\n\n")
print("=" * 30)
print("| Categorical Features |")
print("=" * 30)
for features in categorical_features:
print("| " + features)
print("=" * 30)
# Identify the numeric features we wish to consider.
num_attribs1= X_train_final.select_dtypes(include=['int64', 'float64']).columns
num_pipeline = Pipeline([
('selector', DataFrameSelector(num_attribs1)),
('imputer', SimpleImputer(strategy='mean')),
('std_scaler', StandardScaler()),
])
# Identify the categorical features we wish to consider.
cat_attribs1 = X_train_final.select_dtypes(include=['object']).columns
# Notice handle_unknown="ignore" in OHE which ignore values from the validation/test that
# do NOT occur in the training set
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs1)),
#('imputer', SimpleImputer(strategy='most_frequent')),
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
])
data_prep_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline),
])
To get a baseline, we will use some of the features after being preprocessed through the pipeline. The baseline model is a logistic regression model
from joblib import load
from joblib import dump
def train_model(model_name,model,description,hyper,saved):
input_f=str(X_train_final.shape[1])
model_name=model_name+"-"+input_f+" input features"
if saved:
# load the saved model
pipeline = load(f'{model_name}.joblib')
else:
print(model_name)
pipeline = Pipeline([
("preparation", data_prep_pipeline),
('model',model)])
t0 = time()
pipeline.fit(X_train_final, y_train)
dump(pipeline, f'{model_name}.joblib')
Training_time = time()-t0
if hyper:
print("Best parameters")
best_p=pipeline.named_steps["model"].best_params_
print(best_p)
else:
best_p=pipeline.named_steps["model"].get_params()
print(best_p)
y_pred_train=pipeline.predict(X_train_final)
train_accuracy = accuracy_score(y_train, y_pred_train)
train_r_score=roc_auc_score(y_train, pipeline.predict_proba(X_train_final)[:, 1])
#for validation set
y_pred_valid=pipeline.predict(X_valid_final)
valid_accuracy = accuracy_score(y_valid, y_pred_valid)
valid_r_score=roc_auc_score(y_valid, pipeline.predict_proba(X_valid_final)[:, 1])
# Make predictions on test
t1= time()
y_pred_test=pipeline.predict(X_test_final)
y_pred_model=pipeline.predict_proba(X_test_final)[:, 1]
testing_time=time() - t1
test_accuracy = accuracy_score(y_test, y_pred_test)
test_r_score=roc_auc_score(y_test, pipeline.predict_proba(X_test_final)[:, 1])
print(classification_report(y_test, y_pred_test))
# Create confusion matrix for y_pred_train
cm_train = confusion_matrix(y_train, y_pred_train)
# Create confusion matrix for y_pred_test
cm_test = confusion_matrix(y_test, y_pred_test)
# Visualize confusion matrix for y_pred_train
sns.heatmap(cm_train, annot=True, cmap='Blues')
plt.title('Confusion Matrix (Train)')
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.show()
# Visualize confusion matrix for y_pred_test
sns.heatmap(cm_test, annot=True, cmap='Blues')
plt.title('Confusion Matrix (Test)')
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.show()
expLog.loc[len(expLog)] =[f"{model_name}",f"{train_accuracy*100:8.2f}%",f"{valid_accuracy*100:8.2f}%",f"{test_accuracy*100:8.2f}%",\
f"{train_r_score*100:8.2f}%",f"{valid_r_score*100:8.2f}%",f"{test_r_score*100:8.2f}%",\
f"{Training_time:8.2f} secs", f"{testing_time:8.2f} secs",
f"{best_p}",description]
return y_pred_model
expLog=pd.DataFrame(columns=["exp_name",
"Train Acc",
"Valid Acc",
"Test Acc",
"Train AUC",
"Valid AUC",
"Test AUC",
"Train time",
"Test time",
"Best Params",
"Description"
])
try:
expLog
except NameError:
expLog = pd.DataFrame(columns=["exp_name",
"Train Acc",
"Valid Acc",
"Test Acc",
"Train AUC",
"Valid AUC",
"Test AUC",
"Train time",
"Test time",
"Best Params",
"Description"
])
expLog
%%time
np.random.seed(42)
lr_baseline_pipeline = Pipeline([
("preparation", data_prep_pipeline),
("model", LogisticRegression())
])
lr_base_model = lr_baseline_pipeline.fit(X_train_final, y_train)
y_pred_lr=train_model("Baseline Logisitc Regression",LogisticRegression(),"Joined Feature Aggregated in input features",False,False)
expLog
Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.
The SkLearn roc_auc_score function computes the area under the receiver operating characteristic (ROC) curve, which is also denoted by AUC or AUROC. By computing the area under the roc curve, the curve information is summarized in one number.
from sklearn.metrics import roc_auc_score
>>> y_true = np.array([0, 0, 1, 1])
>>> y_scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> roc_auc_score(y_true, y_scores)
0.75
y_pred_dt=train_model("Baseline Decision Tree",DecisionTreeClassifier(random_state=42),"Joined Feature Aggregated in input features",False,False)
expLog
y_pred_rf=train_model("Baseline Random Forest",RandomForestClassifier(),"Joined Feature Aggregated in input features",False,False)
expLog
y_pred_gb=train_model("Baseline GaussianNB",GaussianNB(),"Joined Feature Aggregated in input features",False,False)
expLog
from sklearn import metrics
plt.figure(figsize=(8, 8))
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_lr)
auc = round(metrics.roc_auc_score(y_test, y_pred_lr), 4)
plt.plot(fpr,tpr,label="Logistic Regression, AUC="+str(auc))
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_dt)
auc = round(metrics.roc_auc_score(y_test, y_pred_dt), 4)
plt.plot(fpr,tpr,label="Decision Tree Classifier, AUC="+str(auc))
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_rf)
auc = round(metrics.roc_auc_score(y_test, y_pred_rf), 4)
plt.plot(fpr,tpr,label="Random Forest Classifier, AUC="+str(auc))
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_gb)
auc = round(metrics.roc_auc_score(y_test, y_pred_gb), 4)
plt.plot(fpr,tpr,label="Gaussian Naive Bayes Classifier, AUC="+str(auc))
plt.legend()
y_pred_XGB=train_model("Baseline XGBboost",XGBClassifier(),"Joined Feature Aggregated in input features",False,False)
expLog
We also tried Gradient Boosting, it was taking more time than XGBoost and the accuracy was still less, so decided to procees with one ensemble technique
#train_model("Baseline GradientBoosting",GradientBoostingClassifier(),"Joined Feature Aggregated in input features",False)
#expLog
Hyperparameters to be tuned:
- tune_model - The machine learning model to be tuned - Logistic Regression
- params:
c - the regularization parameter is set to [0.01, 10.0, 100.0]
penalty - for regularization penalty, we are using both l1 and l2 regularization
- cv - Number of cv folds is 3
- scoring - the evaluation metric is accuracy
- n_jobs - number of parallel jobs, we are using all available processors i.e. -1
- verbose - we used the verbosity as 1 which will print information about the progress of the search
Summary - The process involves trying out all the combinations of the C and penalty hyperparameters specified in the params dictionary, and then evaluating the accuracy of the logistic regression model on the validation data for each combination. The optimal values of C and penalty that maximize the accuracy score, based on the evaluation metric specified by the scoring parameter, which is 'accuracy'
tune_model=LogisticRegression()
params = {'C':[0.01,10.0, 100.0],
'penalty':['l1', 'l2']}
clf_gridsearch = GridSearchCV(tune_model,params,cv=3,scoring='accuracy',n_jobs=-1, verbose=1)
y_pred_lr1=train_model("Tuned Logisitc Regression",clf_gridsearch,"Joined Feature Aggregated+Hyerparameters",True,False)
expLog
Hyperparameters to be tuned:
- gnb - The machine learning model to be tuned - Gaussian Naive Bayes
- params:
var_smoothing -amount of smoothing to be applied is set to [1e-9, 1e-8, 1e-7, 1e-6,1e-10]
- cv - Number of cv folds is 3
- n_jobs - number of parallel jobs, we using all available processors i.e. -1
- verbose - we used the verbosity as 1 which will print information about the progress of the search
Summary - The process involves performing a grid search to find the optimal value of the hyperparameter var_smoothing for the Gaussian Naive Bayes classifier.
There's no need to perform hyperparamter tuning for Gausssian Naive Bayes as it is not suited for this problem as it makes an assumption that data will be continious and normalised and clearly training accuracy is so low.
Hyperparameters to be tuned:
- dt - The machine learning model to be tuned - Decision Tree classifier
- params:
- max_depth -the max depth of the decision tree is set to [2, 4]
- min_samples_split - the minimum number of samples required to split an internal node is set to [2, 4]
- min_samples_leaf - the minimum number of samples required to be at a leaf node is set to [2,4]
- cv - Number of cv folds is 3
- n_jobs - number of parallel jobs, we using all available processors i.e. -1
- verbose - we used the verbosity as 1 which will print information about the progress of the search
# Create a decision tree classifier
dt = DecisionTreeClassifier()
# Define the hyperparameter grid to search over
params = {
'max_depth': [2, 4],
'min_samples_split': [2, 4],
'min_samples_leaf': [2, 4]}
clf_gridsearch = GridSearchCV(dt, params, cv=3,n_jobs=-1, verbose=1)
y_pred_dt1=train_model("Tuned Decision Tree",clf_gridsearch,"Joined Feature Aggregated+Hyerparameters",True,False)
expLog
Hyperparameters to be tuned:
- rf - The machine learning model to be tuned - Random Forest Classifier
- params:
- n_estimators -the number of trees in the forest is set to [100, 200]
- max_depth - the maximum depth of the tree is set to [10, 20, 30]
- min_samples_split - the minimum number of samples required to split an internal node is set to [2, 10]
- min_samples_leaf - the minimum number of samples required to be at a leaf node is set to [1, 2, 4]
- cv - Number of cv folds is 3
- n_jobs - number of parallel jobs, we using all available processors i.e. -1
- verbose - we used the verbosity as 1 which will print information about the progress of the search
Summary - Using the Random Forest Classifier, It sets up a grid of values for various hyperparameters such as n_estimators, max_depth, min_samples_split, and min_samples_leaf. We are using gridsearch to search for the best combination of hyperparameters by trying different combinations of the hyperparameters from the grid and using cross-validation to evaluate their performance. The search is accelerated using all available processing cores and information about the progress of the search is displayed.
rf = RandomForestClassifier()
# Define the hyperparameter grid to search over
params = {
'n_estimators': [10,50],
'min_samples_leaf': [2, 5, 10],
'max_depth': [5, 10]
}
clf_gridsearch = GridSearchCV(rf, params, cv=3,n_jobs=-1, verbose=1)
y_pred_rf1=train_model("Tuned Random Forest Trees",clf_gridsearch,"Joined Feature Aggregated+Hyerparameters",True,False)
expLog
from sklearn import metrics
plt.figure(figsize=(10, 5))
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_lr1)
auc = round(metrics.roc_auc_score(y_test, y_pred_lr), 4)
plt.plot(fpr,tpr,label="Tuned Logistic Regression, AUC="+str(auc))
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_dt1)
auc = round(metrics.roc_auc_score(y_test, y_pred_dt), 4)
plt.plot(fpr,tpr,label="Tuned Decision Tree Classifier, AUC="+str(auc))
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_rf1)
auc = round(metrics.roc_auc_score(y_test, y_pred_rf), 4)
plt.plot(fpr,tpr,label="Tuned Random Forest Classifier, AUC="+str(auc))
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_XGB)
auc = round(metrics.roc_auc_score(y_test, y_pred_XGB), 4)
plt.plot(fpr,tpr,label="XGBoost Classifier, AUC="+str(auc))
plt.legend()
from sklearn.neural_network import MLPClassifier
y_pred_mlp=train_model("Baseline MLP",MLPClassifier(),"Joined Feature Aggregated",False,False)
expLog
mlp = MLPClassifier()
param_grid = {
'hidden_layer_sizes': [(120, 80, 40), (150, 100, 50), (128, 64, 32)],
'max_iter': [5, 10, 50, 100, 150, 200],
'activation': ['tanh', 'relu', 'sigmoid'],
'solver': ['sgd', 'adam'],
'alpha': [0.1, 0.01, 0.001, 0.0001, 0.5, 0.05],
'learning_rate': ['constant','adaptive'],
}
clf_gridsearch = GridSearchCV(mlp, params, cv=3, n_jobs=-1, verbose=1)
train_model("Tuned MLP Classifier",clf_gridsearch,"Joined Feature Aggregated+Hyperparameter Tuning")
expLog
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import os
import zipfile
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import roc_auc_score
import warnings
warnings.filterwarnings('ignore')
import torch
import torch.utils.data
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import time
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter
torch.manual_seed(0)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
from torch.utils.data.sampler import WeightedRandomSampler
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter("runs/exp4")
!pip install tensorboard pyngrok
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: tensorboard in /usr/local/lib/python3.9/dist-packages (2.12.2)
Collecting pyngrok
Downloading pyngrok-6.0.0.tar.gz (681 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 681.2/681.2 kB 22.5 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
Requirement already satisfied: wheel>=0.26 in /usr/local/lib/python3.9/dist-packages (from tensorboard) (0.40.0)
Requirement already satisfied: werkzeug>=1.0.1 in /usr/local/lib/python3.9/dist-packages (from tensorboard) (2.2.3)
Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in /usr/local/lib/python3.9/dist-packages (from tensorboard) (0.7.0)
Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.9/dist-packages (from tensorboard) (3.4.3)
Requirement already satisfied: google-auth-oauthlib<1.1,>=0.5 in /usr/local/lib/python3.9/dist-packages (from tensorboard) (1.0.0)
Requirement already satisfied: requests<3,>=2.21.0 in /usr/local/lib/python3.9/dist-packages (from tensorboard) (2.27.1)
Requirement already satisfied: protobuf>=3.19.6 in /usr/local/lib/python3.9/dist-packages (from tensorboard) (3.20.3)
Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in /usr/local/lib/python3.9/dist-packages (from tensorboard) (1.8.1)
Requirement already satisfied: setuptools>=41.0.0 in /usr/local/lib/python3.9/dist-packages (from tensorboard) (67.7.2)
Requirement already satisfied: grpcio>=1.48.2 in /usr/local/lib/python3.9/dist-packages (from tensorboard) (1.53.0)
Requirement already satisfied: numpy>=1.12.0 in /usr/local/lib/python3.9/dist-packages (from tensorboard) (1.22.4)
Requirement already satisfied: absl-py>=0.4 in /usr/local/lib/python3.9/dist-packages (from tensorboard) (1.4.0)
Requirement already satisfied: google-auth<3,>=1.6.3 in /usr/local/lib/python3.9/dist-packages (from tensorboard) (2.17.3)
Requirement already satisfied: PyYAML in /usr/local/lib/python3.9/dist-packages (from pyngrok) (6.0)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.9/dist-packages (from google-auth<3,>=1.6.3->tensorboard) (0.2.8)
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.9/dist-packages (from google-auth<3,>=1.6.3->tensorboard) (1.16.0)
Requirement already satisfied: cachetools<6.0,>=2.0.0 in /usr/local/lib/python3.9/dist-packages (from google-auth<3,>=1.6.3->tensorboard) (5.3.0)
Requirement already satisfied: rsa<5,>=3.1.4 in /usr/local/lib/python3.9/dist-packages (from google-auth<3,>=1.6.3->tensorboard) (4.9)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.9/dist-packages (from google-auth-oauthlib<1.1,>=0.5->tensorboard) (1.3.1)
Requirement already satisfied: importlib-metadata>=4.4 in /usr/local/lib/python3.9/dist-packages (from markdown>=2.6.8->tensorboard) (6.4.1)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.9/dist-packages (from requests<3,>=2.21.0->tensorboard) (2.0.12)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.9/dist-packages (from requests<3,>=2.21.0->tensorboard) (3.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.9/dist-packages (from requests<3,>=2.21.0->tensorboard) (1.26.15)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.9/dist-packages (from requests<3,>=2.21.0->tensorboard) (2022.12.7)
Requirement already satisfied: MarkupSafe>=2.1.1 in /usr/local/lib/python3.9/dist-packages (from werkzeug>=1.0.1->tensorboard) (2.1.2)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.9/dist-packages (from importlib-metadata>=4.4->markdown>=2.6.8->tensorboard) (3.15.0)
Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /usr/local/lib/python3.9/dist-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard) (0.4.8)
Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.9/dist-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<1.1,>=0.5->tensorboard) (3.2.2)
Building wheels for collected packages: pyngrok
Building wheel for pyngrok (setup.py) ... done
Created wheel for pyngrok: filename=pyngrok-6.0.0-py3-none-any.whl size=19879 sha256=8a66edf919f4e061923607a94d1dcb32baa7728b03b6e5cf948129d2773bba4a
Stored in directory: /root/.cache/pip/wheels/31/49/9c/44b13823eb256a3b4dff34b972f7a3c7d9910bfef269e59bd7
Successfully built pyngrok
Installing collected packages: pyngrok
Successfully installed pyngrok-6.0.0
%load_ext tensorboard
#Read the saved filtered datasets
X_train= pd.read_csv('X_train.csv')
X_valid = pd.read_csv('X_valid.csv')
X_test = pd.read_csv('X_test.csv')
X_kaggle_test = pd.read_csv('X_kaggle_test.csv')
y_train=pd.read_csv('y_train.csv')
y_valid=pd.read_csv('y_valid.csv')
y_test=pd.read_csv('y_test.csv')
#Remove Index Column
X_train= X_train.iloc[:,1:]
X_valid =X_valid.iloc[:,1:]
X_test = X_test.iloc[:,1:]
X_kaggle_test = X_kaggle_test.iloc[:,1:]
y_train=y_train['TARGET']
y_valid=y_valid['TARGET']
y_test=y_test['TARGET']
#WHILE EVAULATING FOR X_KAGGLE_TEST, WE ENCOUNTERED SHAPE DIFFERENCES BECAUSE OF FOLLOWING COLUMNS'S VALUE AS THEY DIDN'T EXIST IN KAGGLE TEST.
#WE DECIDED TO REMOVE THOSE ROWS WHICH DOESN'T CONTAIN THE FOLLOWING CONDITION (THEY ARE ONLY 3-4 ROWS THAT GOT REMOVED)
# Define the conditions for filtering out rows
cond1 = (X_train['CODE_GENDER'] != 'XNA') & (X_train['NAME_INCOME_TYPE'] != 'Maternity leave' ) & (X_train['NAME_FAMILY_STATUS'] != 'Unknown' )
cond2 = (X_valid['CODE_GENDER'] != 'XNA') & (X_valid['NAME_INCOME_TYPE'] != 'Maternity leave') & (X_valid['NAME_FAMILY_STATUS'] != 'Unknown' )
cond3 = (X_test['CODE_GENDER'] != 'XNA') & (X_test['NAME_INCOME_TYPE'] != 'Maternity leave') & (X_test['NAME_FAMILY_STATUS'] != 'Unknown' )
# Filter out rows that meet the conditions
X_train = X_train.loc[cond1]
X_valid = X_valid.loc[cond2]
X_test = X_test.loc[cond3]
# Remove the corresponding rows from the target variable
y_train = y_train.loc[cond1]
y_valid = y_valid.loc[cond2]
y_test = y_test.loc[cond3]
#CONVERT THEM TO NUMPY ARRAY FOR TRAINING
y_train = y_train.to_numpy()
y_validation = y_valid.to_numpy()
y_test = y_test.to_numpy()
#Tranform the features again for Pytorch model
# Create a class to select numerical or categorical columns
# since Scikit-Learn doesn't handle DataFrames yet
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
# Identify the numeric features we wish to consider.
num_attribs1= X_train.select_dtypes(include=['int64', 'float64']).columns
num_pipeline = Pipeline([
('selector', DataFrameSelector(num_attribs1)),
('imputer', SimpleImputer(strategy='mean')),
('std_scaler', StandardScaler()),
])
# Identify the categorical features we wish to consider.
cat_attribs1 = X_train.select_dtypes(include=['object']).columns
# Notice handle_unknown="ignore" in OHE which ignore values from the validation/test that
# do NOT occur in the training set
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs1)),
#('imputer', SimpleImputer(strategy='most_frequent')),
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
])
data_prep_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline),
])
X_train = data_prep_pipeline.fit_transform(X_train)
X_validation = data_prep_pipeline.transform(X_valid)
X_test = data_prep_pipeline.transform(X_test)
# # define undersample strategy
# random_undersampler = RandomUnderSampler(sampling_strategy='majority')
# # fit and apply the transform
# X_train, y_train = random_undersampler.fit_resample(X_train, y_train)
# # Apply SMOTE
# sm = SMOTE(random_state=42)
# X_train, y_train = sm.fit_resample(X_train,y_train)
Multi-Layer Perceptron (MLP) is being utilized to perform binary classification analysis. It is composed of an input layer, one or more hidden layers, and an output layer. The data is fed through the input layer, and the hidden layers perform nonlinear transformations to extract essential features, while the output layer generates the final output of the model.
The data to be processed is received by the input layer, and then it is processed through a sequence of nonlinear transformations in the hidden layers in order to extract important features. Each hidden layer is composed of a set of neurons, where every neuron accepts inputs from the prior layer and generates an output based on an activation function. The output layer generates the ultimate output of the model.
Here, we are creating a linear layer that takes the number of neurons in the previous layer as input and number of neurons in the current layer as output. This linear layer applies a linear transformation to the input and produces an output tensor by multiplying the input tensor with a weight matrix and adding a bias vector.

Batch normalization (BatchNorm1d) is a technique to normalize the inputs of each layer, which is usually done between linear and non-linear transformations. It rescales the activations and stabilizes the learning process by reducing the internal covariate shift. In this implementation, we are applying BatchNorm1d to the output of each linear layer to standardize the inputs to the activation function of the next layer, which helps improve the training speed and stability of the model.
Dropout is a regularization technique that is used to prevent overfitting in neural networks. During training, some random neurons are turned off or “dropped out” with a specified probability. This helps in preventing the network from relying too much on a few neurons and promotes the learning of more robust features. We set the dropout rate to 0.5, meaning that 50% of the neurons are dropped out at each iteration.
Sigmoid activation function is used in neural networks to introduce non-linearity in the output of a neuron. It maps any input value to a value between 0 and 1, which can be interpreted as a probability. It is commonly used in binary classification problems, where the output is interpreted as the probability of belonging to the positive class.

# Define the model architecture
class BinaryClassifier(nn.Module):
def __init__(self, input_size):
super(BinaryClassifier, self).__init__()
self.fc1 = nn.Linear(input_size, 64)
self.fc2 = nn.Linear(64, 32)
self.fc3 = nn.Linear(32, 16)
self.fc4 = nn.Linear(16, 1)
self.relu = nn.ReLU()
self.sigmoid = nn.Sigmoid()
self.dropout = nn.Dropout(p=0.2)
def forward(self, x):
x = self.relu(self.fc1(x))
x = self.relu(self.fc2(x))
x = self.relu(self.fc3(x))
x = self.dropout(x)
x = self.sigmoid(self.fc4(x))
return x
# Define hyperparameters
batch_size = 128
learning_rate = 0.001
# Convert data to PyTorch tensors and create DataLoader
X_train_tensor = torch.from_numpy(X_train).float()
X_valid_tensor = torch.from_numpy(X_validation).float()
X_test_tensor = torch.from_numpy(X_test).float()
y_train_tensor = torch.from_numpy(y_train).float()
y_valid_tensor = torch.from_numpy(y_validation).float()
y_test_tensor = torch.from_numpy(y_test).float()
train_ds = TensorDataset(X_train_tensor, y_train_tensor)
valid_ds = TensorDataset(X_valid_tensor, y_valid_tensor)
test_ds = TensorDataset(X_test_tensor, y_test_tensor)
train_loader = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
valid_loader = DataLoader(valid_ds, batch_size=batch_size)
test_loader = DataLoader(test_ds, batch_size=batch_size)
# Initialize the model and optimizer
model = BinaryClassifier(X_train.shape[1])
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
model
BinaryClassifier( (fc1): Linear(in_features=199, out_features=64, bias=True) (fc2): Linear(in_features=64, out_features=32, bias=True) (fc3): Linear(in_features=32, out_features=16, bias=True) (fc4): Linear(in_features=16, out_features=1, bias=True) (relu): ReLU() (sigmoid): Sigmoid() (dropout): Dropout(p=0.2, inplace=False) )
time1=time.time()
epochs = 20
# Train the model
try:
for epoch in range(epochs):
train_loss = 0
train_acc = 0
train_auc=0
valid_loss = 0
valid_acc = 0
valid_auc=0
pred=[]
epoch_label=[]
# Training loop
model.train()
count=0
for X, y in train_loader:
optimizer.zero_grad()
y_pred = model(X.float())
loss = criterion(y_pred.squeeze(), y)
loss.backward()
optimizer.step()
y_pred=torch.round(y_pred)
train_loss += loss.item()
train_acc += (y_pred.float() == y).float().mean().item()
train_auc += roc_auc_score(y.tolist(), y_pred.tolist())
count += 1
train_loss = np.round(train_loss/count, 3)
train_acc=np.round(train_acc/count,3)
train_auc=np.round(train_auc/count,3)
writer.add_scalar('Training Auc',train_auc,epoch)
writer.add_scalar('Training Loss',train_loss,epoch)
#Validation loop
model.eval()
count=0
with torch.no_grad():
for X, y in valid_loader:
y_pred = model(X)
loss = criterion(y_pred.squeeze(), y)
y_pred=torch.round(y_pred)
valid_loss += loss.item()
valid_acc += (y_pred.float()== y).float().mean().item()
valid_auc += roc_auc_score(y.tolist(), y_pred.tolist())
count += 1
valid_loss = np.round(valid_loss/count, 3)
valid_acc=np.round(valid_acc/count,3)
valid_auc=np.round(valid_auc/count,3)
writer.add_scalar('Validation Auc',valid_auc,epoch)
# Print the results for this epoch
print(f'Epoch {epoch + 1}, Train Loss: {train_loss}, '
f'Train AUC: {train_auc}, '
f'Valid Loss: {valid_loss}, '
f'Valid AUC: {valid_auc}')
total_time=time.time()-time1
#print((epoch,loss, accuracy, auc))
# Set the model to evaluation mode
model.eval()
y_pred = []
y_true = []
# Iterate over the test set and predict the probabilities and true labels
with torch.no_grad():
for inputs, labels in test_loader:
outputs = model(inputs)
y_pred.extend(outputs.squeeze())
y_true.extend(labels.numpy())
# Calculate the ROC AUC score
test_auc = roc_auc_score(y_true, y_pred)
except:
pass
Epoch 1, Train Loss: 0.273, Train AUC: 0.5, Valid Loss: 0.258, Valid AUC: 0.5 Epoch 2, Train Loss: 0.261, Train AUC: 0.504, Valid Loss: 0.256, Valid AUC: 0.505 Epoch 3, Train Loss: 0.257, Train AUC: 0.516, Valid Loss: 0.256, Valid AUC: 0.515 Epoch 4, Train Loss: 0.253, Train AUC: 0.529, Valid Loss: 0.25, Valid AUC: 0.529 Epoch 5, Train Loss: 0.249, Train AUC: 0.542, Valid Loss: 0.25, Valid AUC: 0.531 Epoch 6, Train Loss: 0.246, Train AUC: 0.545, Valid Loss: 0.248, Valid AUC: 0.536 Epoch 7, Train Loss: 0.244, Train AUC: 0.549, Valid Loss: 0.248, Valid AUC: 0.541 Epoch 8, Train Loss: 0.242, Train AUC: 0.555, Valid Loss: 0.251, Valid AUC: 0.545 Epoch 9, Train Loss: 0.241, Train AUC: 0.558, Valid Loss: 0.251, Valid AUC: 0.548 Epoch 10, Train Loss: 0.239, Train AUC: 0.562, Valid Loss: 0.25, Valid AUC: 0.547 Epoch 11, Train Loss: 0.237, Train AUC: 0.565, Valid Loss: 0.254, Valid AUC: 0.542 Epoch 12, Train Loss: 0.236, Train AUC: 0.567, Valid Loss: 0.253, Valid AUC: 0.553 Epoch 13, Train Loss: 0.235, Train AUC: 0.571, Valid Loss: 0.254, Valid AUC: 0.545 Epoch 14, Train Loss: 0.233, Train AUC: 0.573, Valid Loss: 0.254, Valid AUC: 0.548 Epoch 15, Train Loss: 0.232, Train AUC: 0.577, Valid Loss: 0.256, Valid AUC: 0.547 Epoch 16, Train Loss: 0.231, Train AUC: 0.577, Valid Loss: 0.259, Valid AUC: 0.551 Epoch 17, Train Loss: 0.229, Train AUC: 0.584, Valid Loss: 0.26, Valid AUC: 0.557 Epoch 18, Train Loss: 0.229, Train AUC: 0.585, Valid Loss: 0.261, Valid AUC: 0.558 Epoch 19, Train Loss: 0.227, Train AUC: 0.588, Valid Loss: 0.266, Valid AUC: 0.559 Epoch 20, Train Loss: 0.226, Train AUC: 0.592, Valid Loss: 0.261, Valid AUC: 0.552
try: hcdrLog
except : hcdrLog = pd.DataFrame(
columns=[
"Architecture string",
"Optimizer",
"Epochs",
"Train AUC",
"Valid Auc",
"Test AUC",
"Total experiment time"
]
)
hcdrLog
| Architecture string | Optimizer | Epochs | Train AUC | Valid Auc | Test AUC | Total experiment time |
|---|
hcdrLog.loc[len(hcdrLog)] = [
'199-relu-64-relu-32-relu-16-dropout-sigmoid-1',
"adam",
f"{epochs}",
f"{train_auc*100}%",
f"{valid_auc*100}%",
f"{test_auc*100}%",
f"{total_time:8.2f} secs",
]
hcdrLog
| Architecture string | Optimizer | Epochs | Train AUC | Valid Auc | Test AUC | Total experiment time | |
|---|---|---|---|---|---|---|---|
| 0 | 199-relu-32-relu-16-relu-8-sigmoid-1 | adam | 50 | 76.2% | 67.10000000000001% | 73.14872195693694% | 487.11 secs |
| 1 | 199-relu-64-relu-32-relu-16-sigmoid-1 | adam | 50 | 62.8% | 55.400000000000006% | 71.84095564070823% | 433.81 secs |
| 2 | 199-relu-64-relu-32-relu-16-dropout-sigmoid-1 | adam | 10 | 56.3% | 55.00000000000001% | 76.73255313226835% | 203.19 secs |
%tensorboard --logdir=runs/exp4 --host=127.0.0.3

For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. The file should contain a header and have the following format:
SK_ID_CURR,TARGET
100001,0.1
100005,0.9
100013,0.2
etc.
X_kaggle_test= data_prep_pipeline.fit_transform(X_kaggle_test)
# Set the model to evaluation mode
model.eval()
# Convert the Kaggle test data to a tensor and move it to the device
X_kaggle_test_tensor = torch.from_numpy(X_kaggle_test).float()
# Generate predictions for the Kaggle test set
with torch.no_grad():
kaggle_preds = model(X_kaggle_test_tensor)
# Extract the predicted probabilities for the positive class
kaggle_preds = kaggle_preds.numpy()
dataset=pd.read_csv("/content/drive/MyDrive/AML/application_test.csv")
# Submission dataframe
submit_df = dataset[['SK_ID_CURR']]
submit_df['TARGET'] = kaggle_preds
submit_df.head()
| SK_ID_CURR | TARGET | |
|---|---|---|
| 0 | 100001 | 0.044793 |
| 1 | 100005 | 0.122536 |
| 2 | 100013 | 0.072631 |
| 3 | 100028 | 0.062897 |
| 4 | 100038 | 0.984631 |
submit_df.to_csv("MLP_submission5.csv",index=False)
! kaggle competitions submit -c home-credit-default-risk -f MLP_submission5.csv -m "MLP_submission5"
100% 876k/876k [00:03<00:00, 254kB/s] Successfully submitted to Home Credit Default Risk
The objective of the Home Credit Default Risk project is to assess a client's ability to repay a loan based on a collection of features present in the dataset.
In phase 1, we identified the problem as a classification task and selected suitable ML models. Phase 2 involved EDA and creating baseline models with four different algorithms. In Phase 3, we conducted additional feature engineering to identify the most effective features and used GridSearch to further refine the accuracy of our models. We observed that the XGBoost produced a Kaggle score of 0.736.
During this phase, our aim was to implement Multilayer Perceptron architectures to analyze our data and evaluate the performance based on the cross-entropy loss function. We utilized the ReLU activation function with the Adam optimizer and performed 50 epochs using 3 hidden layers and 1 output features. Additionally, we utilized TensorBoard to visualize our accuracy and loss. As a result, we achieved a test ROC of 0.76, with a Kaggle submission public score of 0.728.
Data description:

Task to be tackled:
We handled the following task in this phase:
Workflow Diagram:



Team name: FP_Group22
Team members:
Phase leadership plan:
Credit assignment plan:


When we looked into the leaderboard we observed that the other groups have implemented LGBM which might have played a key role in their public kaggle score being higher than ours. We also saw that they made batch normalization layers in their MLP model which could have also helped with the same. Added to this their initial number of features are more than ours. There could also be a difference in our preprocessing methodologies.
A visualization of the modeling pipelines and subpipelines:

Machine Learning Pipelines:
The model that we are aiming to build should have the capability to classify the target variable into two classes, i.e. 1 or 0 which are the numeric representation for whether the person who took out a loan will repay it or not based on all the data provided to us in the dataset. For achieving this with accurate results we are planning on trying the following algorithms to see which one provides us with the best outcome:
Gaussian Naive Bayes - It is a variant of the Naive Bayes algorithm for classification tasks in machine learning. It is predicated on the Bayes theorem and the supposition of data feature independence. The input characteristics in Gaussian Naive Bayes are thought to have a Gaussian or normal distribution. With the input data, the algorithm determines the likelihood of each class, and then it chooses the class with the highest probability as the anticipated output.
Decision Tree Classification - In this model, based on the characteristics of the input data, a tree-like model of decisions and their potential outcomes is built. The method begins with the complete dataset and chooses the optimal feature to divide the data according to a certain criterion, such Gini Impurity or Information Gain. The data is then divided into subsets depending on the selected feature and the procedure is then performed iteratively for each subset until all the data in each subset belong to the same class, or a stopping requirement is satisfied.
Random Forest Classification - This model builds numerous decision trees using subsets of the input data and randomly choosing features for each split. Each decision tree only sees a subset of the data since they are trained on a bootstrapped sample of the data. The decision trees' majority vote determines the algorithm's final result.
Neural Networks - Neural networks are a class of models that learn by adjusting the weights of the input features to generate a desired output. The input features are fed into a network of interconnected neurons, which transform the input through a series of mathematical operations. These transformations are governed by a set of weights that are learned during training. The output is then compared to the actual target output and the weights are updated using a loss function, such as cross-entropy loss.
Description of metrics and analysis:
The evaluation metrics that we want to use to check whether our model is performing well or not are as follows:
|
True Positive (TP) |
The number of rows that are correctly predicted as positive by the model. |
|
True Negative (TN) |
The number of rows that are correctly predicted as negative by the model. |
|
False Positive (FP) |
The number of rows that are incorrectly predicted as positive by the model. |
|
False Negative (FN) |
The number of rows that are incorrectly predicted as negative by the model. |
| index | exp_name | Train Acc | Valid Acc | Test Acc | Train AUC | Valid AUC | Test AUC | Train time | Test time | Best Params | Description |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Baseline Logisitc Regression-82 input features | 91.32% | 91.35% | 91.28% | 76.09% | 76.11% | 76.16% | 10.06 secs | 0.50 secs | {'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'auto', 'n_jobs': None, 'penalty': 'l2', 'random_state': None, 'solver': 'lbfgs', 'tol': 0.0001, 'verbose': 0, 'warm_start': False} | Joined Feature Aggregated in input features |
| 1 | Baseline Decision Tree-82 input features | 100.00% | 85.41% | 85.33% | 100.00% | 56.18% | 56.81% | 34.50 secs | 0.46 secs | {'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'random_state': 42, 'splitter': 'best'} | Joined Feature Aggregated in input features |
| 2 | Baseline Random Forest-82 input features | 100.00% | 91.91% | 91.94% | 100.00% | 74.77% | 74.38% | 157.52 secs | 3.05 secs | {'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False} | Joined Feature Aggregated in input features |
| 3 | Baseline GaussianNB-82 input features | 28.80% | 28.65% | 28.57% | 64.62% | 64.94% | 64.18% | 2.54 secs | 0.53 secs | {'priors': None, 'var_smoothing': 1e-09} | Joined Feature Aggregated in input features |
| 4 | Baseline XGBboost-82 input features | 92.77% | 91.93% | 91.95% | 89.09% | 77.70% | 77.70% | 199.03 secs | 0.58 secs | {'objective': 'binary:logistic', 'use_label_encoder': None, 'base_score': None, 'booster': None, 'callbacks': None, 'colsample_bylevel': None, 'colsample_bynode': None, 'colsample_bytree': None, 'early_stopping_rounds': None, 'enable_categorical': False, 'eval_metric': None, 'feature_types': None, 'gamma': None, 'gpu_id': None, 'grow_policy': None, 'importance_type': None, 'interaction_constraints': None, 'learning_rate': None, 'max_bin': None, 'max_cat_threshold': None, 'max_cat_to_onehot': None, 'max_delta_step': None, 'max_depth': None, 'max_leaves': None, 'min_child_weight': None, 'missing': nan, 'monotone_constraints': None, 'n_estimators': 100, 'n_jobs': None, 'num_parallel_tree': None, 'predictor': None, 'random_state': None, 'reg_alpha': None, 'reg_lambda': None, 'sampling_method': None, 'scale_pos_weight': None, 'subsample': None, 'tree_method': None, 'validate_parameters': None, 'verbosity': None} | Joined Feature Aggregated in input features |
| 5 | Tuned Logisitc Regression-82 input features | 91.32% | 91.34% | 91.29% | 76.03% | 76.09% | 76.14% | 56.75 secs | 0.44 secs | {'C': 0.01, 'penalty': 'l2'} | Joined Feature Aggregated+Hyerparameters |
| 6 | Tuned Decision Tree-82 input features | 91.33% | 91.33% | 91.34% | 67.35% | 67.07% | 66.84% | 81.10 secs | 0.61 secs | {'max_depth': 2, 'min_samples_leaf': 2, 'min_samples_split': 2} | Joined Feature Aggregated+Hyerparameters |
| 7 | Tuned Random Forest Trees-82 input features | 91.80% | 91.76% | 91.76% | 81.64% | 74.13% | 73.91% | 282.92 secs | 0.68 secs | {'max_depth': 10, 'min_samples_leaf': 2, 'n_estimators': 10} | Joined Feature Aggregated+Hyerparameters |
While building the MLP, we had the following discussion about its architecture before deciding on best model.
In conclusion, our project focused on predicting the risk of loan defaults for Home Credit using machine learning techniques. This is an important problem as loan defaults can have a significant impact on financial institutions and their clients.
In phase 2, we conducted EDA to identify significant features and created ML pipelines using various classifiers. Underfitting was observed in Gaussian Naive Bayes and overfitting in Decision Tree and Random Forest. In the next phase, advanced feature engineering and hyperparameter tuning will be implemented to enhance model evaluation metrics.
In phase 3, we explored, cleaned, and engineered features for both categorical and numerical data, trained multiple models, and optimized their hyperparameters. Our best model achieved a ROC AUC score of 0.78 on the test data, outperforming the baseline. Important features for predicting loan defaults were past payment history, income, credit bureau reports, and external sources. In the future, we plan to implement a new multitask loss function in PyTorch to predict both the class and length of time before defaulting. Our findings and model could have practical implications for mitigating the risk of loan defaults in financial institutions.
In this phase, we built a neural network with 199 input features, 3 hidden layers (64, 32, 16 neurons), and 1 output features, using the ReLU activation function on the hidden layers and the Sigmoid activation function on the output layer and Adam Optimizer. Cross Entropy Loss Function was used for evaluation. The MLP model had a test roc score of 0.76, and a Kaggle submission public score of 0.728. However, the AUC score of the non-deep learning models (Logistic Regression, Random Forest, and XGBoost) was better. The MLP model can be improved using embeddings or advanced classification models like boosting.


Read the following: